California Housing analysis and prediction

Mohamed Khazraji
5 min readFeb 5, 2022

I will analyze California Housing Data (1990). It can be downloaded from Kaggle [ https://www.kaggle.com/harrywang/housing ]

We will predict the median price of households in the block.

The data set consists of 20640 rows and 10 features:

  1. longitude: A measure of how far west a house is; a higher value is farther west
  2. latitude: A measure of how far north a house is; a higher value is farther north
  3. housing_Median_Age: Median age of a house within a block; a lower number is a newer building
  4. total_Rooms: Total number of rooms within a block
  5. total_Bedrooms: Total number of bedrooms within a block
  6. population: Total number of people residing within a block
  7. households: Total number of households, a group of people residing within a home unit, for a block
  8. median_Income: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
  9. median_House_Value: Median house value for households within a block (measured in US Dollars)
  10. ocean_Proximity: Location of the house w.r.t ocean/sea

median_house_value is our target feature, we will use other features to predict it.

The task is to predict how much the houses in a particular block cost (the median) based on information of blocks location and basic sociodemographic data.

We can see that most columns has no nan values (except total_bedrooms), most features has float format, only 1 feature is categorical — ocean_proximity.

There are no obvious reasons for some total_bedrooms to be NaN. The number of NaNs is about 1% of the total dataset. Maybe we could just drop these rows or fill them with mean/median values, but let’s wait for a while, and deal with blanks after initial data analysis in a smarter manner.

after that, I plot longitude, latitude, population, and median_house_value. the bigger circular basically represents the population, and the red color represents how costly the houses are.

So, if you see most of the costlier houses are near the ocean, and the fewer price houses are at the far rear end.

Now, let's see the correlation between median_house_value and the other features.

it's always between -1 (less correlated) and 1 (highly correlated).

median_income has about 69% correlated with the median_house_value, so all these numbers basically represent how they influence the actual median house value, and anything that's negative is showing basically how median_house_value is independent of these features.

So median_income has more effect on the median_house_value.

after that, I create three new features, It help me when I made the prediction in machine learning algorithms, let's see the correlation matrix with respect to median_house_value.

the new bedrooms_per_room is highly correlated but in a reciprocative way to the median_house_value, So the houses with a lesser bedroom/room ratio will tend to be more expensive.

I plot rooms_per_household with respect to median_house_value and we can notice that it's kind of correlated. the rooms_per_household ratio increasing, the median_house_value decreasing.

the rooms_per_household ratio increasing, the median_house_value decreasing.

Now let's convert the ocean_Proximity column which is the only categorical in our data set to numerical by using dummy variables.

after that, I dropped the messing values and now my data set is ready for prediction by machine learning algorithms.

First: I divided my data set to X and Y and I used the standard scaler function for X to be on the same scale.

second: I used 30% for testing and 70% for training and validation.

I used machine learning algorithms to make the prediction, and we will see the training accuracy and testing accuracy to be sure there is no overfitting or underfitting in the prediction.

LinearRegression

KNeighborsRegressor

RandomForestRegressor

GradientBoostingRegressor

GradientBoostingRegressor had the best performance among the other machine learning algorithms that I used to make the prediction.

After that, I tried to use machine learning algorithms for classification.

I create a new column based on assumption that I have a budget (200000$) and I want to classify the house's prices to fit my budget or not.

fit_with_badget is our target feature.

LogisticRegression

We can notice from the Confusion Matrix, precision, recall, and f1_score that the model got very good accuracy to classify the house’s prices to fit the budget or not.

Conclusions

To sum up, we have got a solution that can predict the mean house value in the block. It is not an extremely precise prediction, but it seems that it is near the possible solution for these classes of models based on this data (it is a popular dataset but I have not found any solution with significantly better results).

We have used old Californian data from 1990 so it is not useful right now. But the same approach can be used to predict modern house prices (if applied to the recent market data).

We have done a lot but the results surely can be improved, at least one could try:

  1. feature engineering: polynomial features, better distances to cities (not Euclidean ones, ellipse representation of cities), average values of the target for the geographically closest neighbors (requires custom estimator function for correct cross-validation)
  2. 2. PCA for dimensionality reduction

3. other models

--

--