Regularization of linear regression modelΒΆ

In this notebook, we will see the limitations of linear regression models and the advantage of using regularized models instead.

Besides, we will also present the preprocessing required when dealing with regularized models, furthermore when the regularization parameter needs to be tuned.

We will start by highlighting the over-fitting issue that can arise with a simple linear regression model.

Effect of regularizationΒΆ

We will first load the California housing dataset.

Note

If you want a deeper overview regarding this dataset, you can refer to the Appendix - Datasets description section at the end of this MOOC.

from sklearn.datasets import fetch_california_housing

data, target = fetch_california_housing(as_frame=True, return_X_y=True)
target *= 100  # rescale the target in k$
data.head()
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25

In one of the previous notebook, we showed that linear models could be used even in settings where data and target are not linearly linked.

We showed that one can use the PolynomialFeatures transformer to create additional features encoding non-linear interactions between features.

Here, we will use this transformer to augment the feature space. Subsequently, we will train a linear regression model. We will use the out-of-sample test set to evaluate the generalization capabilities of our model.

from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

linear_regression = make_pipeline(PolynomialFeatures(degree=2),
                                  LinearRegression())
cv_results = cross_validate(linear_regression, data, target,
                            cv=10, scoring="neg_mean_squared_error",
                            return_train_score=True,
                            return_estimator=True)

We can compare the mean squared error on the training and testing set to assess the generalization performance of our model.

train_error = -cv_results["train_score"]
print(f"Mean squared error of linear regression model on the train set:\n"
      f"{train_error.mean():.3f} +/- {train_error.std():.3f}")
Mean squared error of linear regression model on the train set:
4190.212 +/- 151.123
test_error = -cv_results["test_score"]
print(f"Mean squared error of linear regression model on the test set:\n"
      f"{test_error.mean():.3f} +/- {test_error.std():.3f}")
Mean squared error of linear regression model on the test set:
13334.943 +/- 20292.681

The score on the training set is much better. This statistical performance gap between the training and testing score is an indication that our model overfitted our training set.

Indeed, this is one of the danger when augmenting the number of features with a PolynomialFeatures transformer. Our model will focus on some specific features. We can check the weights of the model to have a confirmation. Let’s create a dataframe: the columns will contain the name of the feature while the line the coefficients values stored by each model during the cross-validation.

Since we used a PolynomialFeatures to augment the data, we will create feature names representative of the feature combination. Scikit-learn provides a get_feature_names method for this purpose. First, let’s get the first fitted model from the cross-validation.

model_first_fold = cv_results["estimator"][0]

Now, we can access to the fitted PolynomialFeatures to generate the feature names

feature_names = model_first_fold[0].get_feature_names(
    input_features=data.columns)
feature_names
['1',
 'MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude',
 'MedInc^2',
 'MedInc HouseAge',
 'MedInc AveRooms',
 'MedInc AveBedrms',
 'MedInc Population',
 'MedInc AveOccup',
 'MedInc Latitude',
 'MedInc Longitude',
 'HouseAge^2',
 'HouseAge AveRooms',
 'HouseAge AveBedrms',
 'HouseAge Population',
 'HouseAge AveOccup',
 'HouseAge Latitude',
 'HouseAge Longitude',
 'AveRooms^2',
 'AveRooms AveBedrms',
 'AveRooms Population',
 'AveRooms AveOccup',
 'AveRooms Latitude',
 'AveRooms Longitude',
 'AveBedrms^2',
 'AveBedrms Population',
 'AveBedrms AveOccup',
 'AveBedrms Latitude',
 'AveBedrms Longitude',
 'Population^2',
 'Population AveOccup',
 'Population Latitude',
 'Population Longitude',
 'AveOccup^2',
 'AveOccup Latitude',
 'AveOccup Longitude',
 'Latitude^2',
 'Latitude Longitude',
 'Longitude^2']

Finally, we can create the dataframe containing all the information.

import pandas as pd

coefs = [est[-1].coef_ for est in cv_results["estimator"]]
weights_linear_regression = pd.DataFrame(coefs, columns=feature_names)

Now, let’s use a box plot to see the coefficients variations.

import matplotlib.pyplot as plt

color = {"whiskers": "black", "medians": "black", "caps": "black"}
weights_linear_regression.plot.box(color=color, vert=False, figsize=(6, 16))
_ = plt.title("Linear regression coefficients")
../_images/linear_models_regularization_15_0.png

We can force the linear regression model to consider all features in a more homogeneous manner. In fact, we could force large positive or negative weight to shrink toward zero. This is known as regularization. We will use a ridge model which enforces such behavior.

from sklearn.linear_model import Ridge

ridge = make_pipeline(PolynomialFeatures(degree=2),
                      Ridge(alpha=100))
cv_results = cross_validate(ridge, data, target,
                            cv=10, scoring="neg_mean_squared_error",
                            return_train_score=True,
                            return_estimator=True)
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/linear_model/_ridge.py:148: LinAlgWarning: Ill-conditioned matrix (rcond=2.672e-17): result may not be accurate.
  overwrite_a=True).T
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/linear_model/_ridge.py:148: LinAlgWarning: Ill-conditioned matrix (rcond=2.67257e-17): result may not be accurate.
  overwrite_a=True).T
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/linear_model/_ridge.py:148: LinAlgWarning: Ill-conditioned matrix (rcond=2.75536e-17): result may not be accurate.
  overwrite_a=True).T
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/linear_model/_ridge.py:148: LinAlgWarning: Ill-conditioned matrix (rcond=2.67367e-17): result may not be accurate.
  overwrite_a=True).T
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/linear_model/_ridge.py:148: LinAlgWarning: Ill-conditioned matrix (rcond=3.5546e-17): result may not be accurate.
  overwrite_a=True).T
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/linear_model/_ridge.py:148: LinAlgWarning: Ill-conditioned matrix (rcond=2.75974e-17): result may not be accurate.
  overwrite_a=True).T
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/linear_model/_ridge.py:148: LinAlgWarning: Ill-conditioned matrix (rcond=2.82401e-17): result may not be accurate.
  overwrite_a=True).T
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/linear_model/_ridge.py:148: LinAlgWarning: Ill-conditioned matrix (rcond=4.96672e-17): result may not be accurate.
  overwrite_a=True).T
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/linear_model/_ridge.py:148: LinAlgWarning: Ill-conditioned matrix (rcond=2.68318e-17): result may not be accurate.
  overwrite_a=True).T
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/linear_model/_ridge.py:148: LinAlgWarning: Ill-conditioned matrix (rcond=2.68514e-17): result may not be accurate.
  overwrite_a=True).T
train_error = -cv_results["train_score"]
print(f"Mean squared error of linear regression model on the train set:\n"
      f"{train_error.mean():.3f} +/- {train_error.std():.3f}")
Mean squared error of linear regression model on the train set:
4373.180 +/- 153.942
test_error = -cv_results["test_score"]
print(f"Mean squared error of linear regression model on the test set:\n"
      f"{test_error.mean():.3f} +/- {test_error.std():.3f}")
Mean squared error of linear regression model on the test set:
7303.589 +/- 4950.732

We see that the training and testing scores are much closer, indicating that our model is less overfitting. We can compare the values of the weights of ridge with the un-regularized linear regression.

coefs = [est[-1].coef_ for est in cv_results["estimator"]]
weights_ridge = pd.DataFrame(coefs, columns=feature_names)
weights_ridge.plot.box(color=color, vert=False, figsize=(6, 16))
_ = plt.title("Ridge weights")
../_images/linear_models_regularization_22_0.png

By comparing the magnitude of the weights on this plot compared to the previous plot, we see that the magnitude of the weights are shrunk towards zero in comparison with the linear regression model.

However, in this example, we omitted two important aspects: (i) the need to scale the data and (ii) the need to search for the best regularization parameter.

Scale your data!ΒΆ

Regularization will add constraints on weights of the model. We saw in the previous example that a ridge model will enforce that all weights have a similar magnitude. Indeed, the larger alpha is, the larger this enforcement will be.

This procedure should make us think about feature rescaling. Let’s consider the case where features have an identical data dispersion: if two features are found equally important by the model, they will be affected similarly by regularization strength.

Now, let’s consider the scenario where features have completely different data dispersion (for instance age in years and annual revenue in dollars). If two features are as important, our model will boost the weights of features with small dispersion and reduce the weights of features with high dispersion.

We recall that regularization forces weights to be closer. Therefore, we get an intuition that if we want to use regularization, dealing with rescaled data would make it easier to find an optimal regularization parameter and thus an adequate model.

As a side note, some solvers based on gradient computation are expecting such rescaled data. Unscaled data will be detrimental when computing the optimal weights. Therefore, when working with a linear model and numerical data, it is generally good practice to scale the data.

Thus, we will add a StandardScaler in the machine learning pipeline. This scaler will be placed just before the regressor.

from sklearn.preprocessing import StandardScaler

ridge = make_pipeline(PolynomialFeatures(degree=2), StandardScaler(),
                      Ridge(alpha=0.5))
cv_results = cross_validate(ridge, data, target,
                            cv=10, scoring="neg_mean_squared_error",
                            return_train_score=True,
                            return_estimator=True)
train_error = -cv_results["train_score"]
print(f"Mean squared error of linear regression model on the train set:\n"
      f"{train_error.mean():.3f} +/- {train_error.std():.3f}")
Mean squared error of linear regression model on the train set:
4347.036 +/- 156.666
test_error = -cv_results["test_score"]
print(f"Mean squared error of linear regression model on the test set:\n"
      f"{test_error.mean():.3f} +/- {test_error.std():.3f}")
Mean squared error of linear regression model on the test set:
5508.472 +/- 1816.642

We observe that scaling data has a positive impact on the test score and that the test score is closer to the train score. It means that our model is less overfitted and that we are getting closer to the best generalization sweet spot.

Let’s have an additional look to the different weights.

coefs = [est[-1].coef_ for est in cv_results["estimator"]]
weights_ridge = pd.DataFrame(coefs, columns=feature_names)
weights_ridge.plot.box(color=color, vert=False, figsize=(6, 16))
_ = plt.title("Ridge weights with data scaling")
../_images/linear_models_regularization_29_0.png

Compare to the previous plots, we see that now all weight manitudes are closer and that all weights are more equally contributing.

In the previous analysis, we did not study if the parameter alpha will have an effect on the performance. We chose the parameter beforehand and fix it for the analysis.

In the next section, we will check the impact of this hyperparameter and how it should be tuned.

Fine tuning the regularization parameterΒΆ

As mentioned, the regularization parameter needs to be tuned on each dataset. The default parameter will not lead to the optimal model. Therefore, we need to tune the alpha parameter.

Model hyperparameter tuning should be done with care. Indeed, we want to find an optimal parameter that maximizes some metrics. Thus, it requires both a training set and testing set.

However, this testing set should be different from the out-of-sample testing set that we used to evaluate our model: if we use the same one, we are using an alpha which was optimized for this testing set and it breaks the out-of-sample rule.

Therefore, we should include search of the hyperparameter alpha within the cross-validation. As we saw in previous notebooks, we could use a grid-search. However, some predictor in scikit-learn are available with an integrated hyperparameter search, more efficient than using a grid-search. The name of these predictors finishes by CV. In the case of Ridge, scikit-learn provides a RidgeCV regressor.

Therefore, we can use this predictor as the last step of the pipeline. Including the pipeline a cross-validation allows to make a nested cross-validation: the inner cross-validation will search for the best alpha, while the outer cross-validation will give an estimate of the testing score.

import numpy as np
from sklearn.linear_model import RidgeCV

alphas = np.logspace(-2, 0, num=20)
ridge = make_pipeline(PolynomialFeatures(degree=2), StandardScaler(),
                      RidgeCV(alphas=alphas, store_cv_values=True))
from sklearn.model_selection import ShuffleSplit

cv = ShuffleSplit(n_splits=5, random_state=1)
cv_results = cross_validate(ridge, data, target,
                            cv=cv, scoring="neg_mean_squared_error",
                            return_train_score=True,
                            return_estimator=True, n_jobs=2)
train_error = -cv_results["train_score"]
print(f"Mean squared error of linear regression model on the train set:\n"
      f"{train_error.mean():.3f} +/- {train_error.std():.3f}")
Mean squared error of linear regression model on the train set:
4306.562 +/- 25.918
test_error = -cv_results["test_score"]
print(f"Mean squared error of linear regression model on the test set:\n"
      f"{test_error.mean():.3f} +/- {test_error.std():.3f}")
Mean squared error of linear regression model on the test set:
4348.657 +/- 252.921

By optimizing alpha, we see that the training an testing scores are closed. It indicates that our model is not overfitting.

When fitting the ridge regressor, we also requested to store the error found during cross-validation (by setting the parameter store_cv_values=True). We will plot the mean squared error for the different alphas regularization strength that we tried.

mse_alphas = [est[-1].cv_values_.mean(axis=0)
              for est in cv_results["estimator"]]
cv_alphas = pd.DataFrame(mse_alphas, columns=alphas)
cv_alphas
0.010000 0.012743 0.016238 0.020691 0.026367 0.033598 0.042813 0.054556 0.069519 0.088587 0.112884 0.143845 0.183298 0.233572 0.297635 0.379269 0.483293 0.615848 0.784760 1.000000
0 7587.897141 7059.531216 6579.796667 6161.839629 5813.048345 5535.350138 5326.646369 5182.950517 5100.749049 5079.212663 5122.029454 5238.704149 5445.118890 5763.012259 6217.925433 6835.274126 7634.692137 8623.620241 9791.918443 11109.476019
1 7079.341771 6696.622046 6329.022308 5987.727181 5681.306000 5415.171818 5191.880969 5012.209177 4876.780585 4787.974274 4751.851652 4779.853646 4889.937328 5106.656625 5459.549144 5979.280717 6691.530974 7609.581815 8727.609953 10017.092899
2 24857.406605 21448.284772 18293.478416 15481.009011 13067.031991 11071.457116 9480.437579 8254.431555 7338.963939 6675.026438 6207.024331 5887.541169 5679.263698 5554.913823 5496.051755 5491.367683 5534.775517 5623.398006 5755.456140 5928.154410
3 7504.216958 7125.074257 6754.172985 6400.999897 6072.937966 5774.941819 5509.955509 5280.027720 5087.960263 4939.271829 4844.228092 4819.618117 4889.808854 5086.420585 5445.877349 6004.320850 6790.113763 7815.311733 9068.542918 10511.939341
4 6999.938808 6598.898743 6215.363499 5861.328579 5546.025773 5275.377052 5052.349007 4878.140850 4753.964890 4683.117084 4673.049960 4737.162496 4895.935124 5176.863333 5612.500529 6236.037942 7074.449338 8140.303978 9424.471281 10892.352852
cv_alphas.mean(axis=0).plot(marker="+")
plt.ylabel("Mean squared error\n (lower is better)")
plt.xlabel("alpha")
_ = plt.title("Error obtained by cross-validation")
../_images/linear_models_regularization_37_0.png

As we can see, regularization is just like salt in cooking: one must balance its amount to get the best statistical performance. We can check if the best alpha found is stable across the cross-validation fold.

best_alphas = [est[-1].alpha_ for est in cv_results["estimator"]]
best_alphas
[0.08858667904100823,
 0.11288378916846889,
 0.37926901907322497,
 0.14384498882876628,
 0.11288378916846889]

In this notebook, you learned about the concept of regularization and the importance of preprocessing and parameter tuning.