📝 Exercise M4.03#

In all previous notebooks, we only used a single feature in data. But we have already shown that we could add new features to make the model more expressive by deriving new features, based on the original feature.

The aim of this notebook is to train a linear regression algorithm on a dataset with more than a single feature.

We will load a dataset about house prices in California. The dataset consists of 8 features regarding the demography and geography of districts in California and the aim is to predict the median house price of each district. We will use all 8 features to predict the target, the median house price.

Note

If you want a deeper overview regarding this dataset, you can refer to the Appendix - Datasets description section at the end of this MOOC.

from sklearn.datasets import fetch_california_housing

data, target = fetch_california_housing(as_frame=True, return_X_y=True)
target *= 100  # rescale the target in k$
data.head()
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25

Now it is your turn to train a linear regression model on this dataset. First, create a linear regression model.

# Write your code here.

Execute a cross-validation with 10 folds and use the mean absolute error (MAE) as metric. Be sure to return the fitted estimators.

# Write your code here.

Compute the mean and std of the MAE in thousands of dollars (k$).

# Write your code here.

Inspect the fitted model using a box plot to show the distribution of values for the coefficients returned from the cross-validation. Hint: use the function df.plot.box() to create a box plot.

# Write your code here.