top of page
  • Writer's picturesizwe mashao

Using Generative AI to analyse rental data



We previously looked the main factors affecting the rental prices in Belgium. In this edition will do the same but with a twist, we will let ChatGPT analyse the same dataset and give use the results without any tuning. I will let ChatGPT do all the work to see how good its data analytical capacity is. The only thing I will is upload the data and give it prompts. Without further adieu, here we go.


Prompt: You are are data scientist looking at the determinants of rent in Belgium. Use the uploaded dataset to do so

ChatGPT: Great! You've provided a dataset for analysis. I'll begin by loading the data to understand its structure and the types of variables it contains. This will help in determining the best approach to analyze the determinants of rent in Belgium. Let's get started.

The dataset contains various columns that could be relevant in determining the rent prices in Belgium. Here's a brief overview of the columns:

  1. Price: The rental price (presumably in euros).

  2. District: The district in which the property is located.

  3. Zip: Postal code of the property.

  4. BuildingCondition: The condition of the building (e.g., as new, good).

  5. EPC: Energy performance certificate rating.

  6. SQM: Square meters of the property.

  7. Bedrooms: Number of bedrooms.

  8. Bathrooms: Number of bathrooms.

  9. Is_Furnished: Whether the property is furnished (0 for no, 1 for yes).

  10. Has_Terrace: Whether the property has a terrace (0 for no, 1 for yes).

  11. Has_Garden: Whether the property has a garden (0 for no, 1 for yes).

  12. Has_Laundry: Whether the property has laundry facilities (0 for no, 1 for yes).

  13. Has_Parking: Whether the property has parking (0 for no, 1 for yes).

  14. Has_DoubleGlazing: Whether the property has double glazing (0 for no, 1 for yes).

  15. Nearest_BusStop: Distance to the nearest bus stop (in meters).

  16. Nearest_TramStop: Distance to the nearest tram stop (in meters).

  17. Nearest_TrainStation: Distance to the nearest train station (in meters).

  18. Mobiscore: A score related to the mobility and accessibility of the location.

  19. Density: Population density in the area.

  20. NetIncomePC: Net income per capita in the area.

To analyze the determinants of rent, we can perform a regression analysis where "Price" will be the dependent variable, and the rest of the columns will serve as independent variables. This will help us understand the impact of each variable on rental prices.

Before proceeding with the regression analysis, it's important to check for any missing values and understand the distribution of the data. Let's do that next. ​​

The dataset doesn't have any missing values, which is great for analysis. It contains both numerical and categorical variables. Here's a summary of the data:

  • Numerical variables like Price, SQM (square meters), Bedrooms, Bathrooms, distances to nearest public transport stops, Mobiscore, Density, and NetIncomePC show a wide range of values.

  • Categorical variables include District, BuildingCondition, EPC (Energy Performance Certificate rating), and several binary features indicating the presence of certain amenities (like Is_Furnished, Has_Terrace, Has_Garden, etc.).

Given this mix of numerical and categorical variables, the following steps can be taken for analysis:

  1. Encoding Categorical Variables: Convert categorical variables into a format that can be used for regression analysis. This typically involves one-hot encoding or label encoding.

  2. Scaling Numerical Variables: Since the numerical variables have different scales, it would be beneficial to scale them. This is especially important for distance-related variables and those with a wide range of values.

  3. Regression Analysis: Perform a regression analysis to identify the significant determinants of rent. We can start with a linear regression model and check if the assumptions of linear regression are met.

  4. Model Evaluation: Evaluate the model using appropriate metrics like R-squared, RMSE (Root Mean Square Error), etc.

Shall I proceed with these steps and perform the regression analysis? ​


Prompt: Yes, please proceed with the analysis


ChatGPT:


R-squared (R²): Approximately 0.636. This indicates that about 63.6% of the variance in the rent prices can be explained by the model. An R² of 0.636 is relatively decent for real-world data, but it also suggests that there are other factors not captured by the model that influence rent prices.

  1. Root Mean Square Error (RMSE): Approximately 430.44 euros. This value represents the standard deviation of the residuals (prediction errors). It means that the model's predictions are, on average, about 430.44 euros off from the actual rent prices.

These metrics suggest that the model has a moderate level of predictive ability. To improve the model, you could consider adding more relevant features, using more complex modeling techniques, or tuning the model's parameters.

Would you like to proceed with a more detailed analysis, such as examining the specific coefficients of the model to understand the impact of each variable?


Prompt: Summarise the model for me. Explain the main coefficients


ChatGPT: To summarize the model and explain the main coefficients, I will retrieve the coefficients from the linear regression model and match them to their respective features. This will help us understand the influence of each variable on the rental price.

Since we used one-hot encoding for categorical variables, there will be multiple coefficients for each of these variables, corresponding to their categories. For numerical variables, there will be a single coefficient for each.

Let's extract and interpret these coefficients.

Here's a summary of the model with an emphasis on the top 10 coefficients:

  1. District_Brugge: A coefficient of +455.86, suggesting that properties in the Brugge district tend to have higher rent prices, all else being equal.

  2. Zip: The coefficient of -436.58 indicates that higher postal code numbers are associated with lower rent prices. This could reflect geographic or socio-economic trends within Belgium.

  3. District_Gent: A coefficient of +431.48, similar to Brugge, indicating that properties in the Gent district also command higher rents.

  4. District_Antwerp: This district has a coefficient of -424.71, suggesting lower rent prices compared to the baseline district.

  5. SQM (Square Meters): A positive coefficient of +287.72, indicating that larger properties (in terms of square meters) are associated with higher rent prices.

  6. District_Brussels: This district has a negative coefficient of -281.25, suggesting lower rents compared to the baseline.

  7. District_Leuven: A negative coefficient of -181.38, indicating lower rent prices in this district.

  8. Bathrooms: Each additional bathroom is associated with an increase in rent price, as indicated by a coefficient of +156.98.

  9. Nearest_TramStop: A negative coefficient of -125.62, implying that properties further away from tram stops might have higher rent, possibly reflecting a preference for less busy areas.

  10. Is_Furnished: Properties that are furnished have a positive coefficient of +98.79, suggesting that furnished properties command higher rents.

These coefficients provide insights into how various features impact rental prices in Belgium. Districts play a significant role, as do property size and amenities. It's important to note that these relationships are correlations and do not imply causation. Also, the model's accuracy and the relevance of these coefficients should be considered in the context of the model's overall R-squared and RMSE values. ​​


Prompt: Thank you.


Well there you have it, a decent regression model created with just a few prompts. While this is far from perfect it is a good start. It will not replace a seasoned data scientist but with the right usage it can be a powerful ally.

7 views0 comments
bottom of page