πŸ“ƒ Solution for Exercise M4.04ΒΆ

In the previous notebook, we saw the effect of applying some regularization on the coefficient of a linear model.

In this exercise, we will study the advantage of using some regularization when dealing with correlated features.

We will first create a regression dataset. This dataset will contain 2,000 samples and 5 features from which only 2 features will be informative.

from sklearn.datasets import make_regression

data, target, coef = make_regression(
    n_samples=2_000,
    n_features=5,
    n_informative=2,
    shuffle=False,
    coef=True,
    random_state=0,
    noise=30,
)

When creating the dataset, make_regression returns the true coefficient used to generate the dataset. Let’s plot this information.

import pandas as pd

feature_names = [f"Features {i}" for i in range(data.shape[1])]
coef = pd.Series(coef, index=feature_names)
coef.plot.barh()
coef
Features 0     9.566665
Features 1    40.192077
Features 2     0.000000
Features 3     0.000000
Features 4     0.000000
dtype: float64
../_images/linear_models_sol_04_3_1.png

Create a LinearRegression regressor and fit on the entire dataset and check the value of the coefficients. Are the coefficients of the linear regressor close to the coefficients used to generate the dataset?

from sklearn.linear_model import LinearRegression

linear_regression = LinearRegression()
linear_regression.fit(data, target)
linear_regression.coef_
array([10.89587004, 40.41128042, -0.20542454, -0.18954462,  0.11129768])
feature_names = [f"Features {i}" for i in range(data.shape[1])]
coef = pd.Series(linear_regression.coef_, index=feature_names)
_ = coef.plot.barh()
../_images/linear_models_sol_04_6_0.png

We see that the coefficients are close to the coefficients used to generate the dataset. The dispersion is indeed cause by the noise injected during the dataset generation.

Now, create a new dataset that will be the same as data with 4 additional columns that will repeat twice features 0 and 1. This procedure will create perfectly correlated features.

import numpy as np

data = np.concatenate([data, data[:, [0, 1]], data[:, [0, 1]]], axis=1)

Fit again the linear regressor on this new dataset and check the coefficients. What do you observe?

linear_regression = LinearRegression()
linear_regression.fit(data, target)
linear_regression.coef_
array([ 1.33594010e+12, -1.62497905e+14, -2.05078125e-01, -1.77612305e-01,
        9.71679688e-02, -6.67970049e+11,  4.20600332e+13, -6.67970049e+11,
        1.20437872e+14])
feature_names = [f"Features {i}" for i in range(data.shape[1])]
coef = pd.Series(linear_regression.coef_, index=feature_names)
_ = coef.plot.barh()
../_images/linear_models_sol_04_12_0.png

We see that the coefficient values are far from what one could expect. By repeating the informative features, one would have expected these coefficients to be similarly informative.

Instead, we see that some coefficients have a huge norm ~1e14. It indeed means that we try to solve an mathematical ill-posed problem. Indeed, finding coefficients in a linear regression involves inverting the matrix np.dot(data.T, data) which is not possible (or lead to high numerical errors).

Create a ridge regressor and fit on the same dataset. Check the coefficients. What do you observe?

from sklearn.linear_model import Ridge

ridge = Ridge()
ridge.fit(data, target)
ridge.coef_
array([ 3.6313933 , 13.46802113, -0.20549345, -0.18929961,  0.11117205,
        3.6313933 , 13.46802113,  3.6313933 , 13.46802113])
coef = pd.Series(ridge.coef_, index=feature_names)
_ = coef.plot.barh()
../_images/linear_models_sol_04_16_0.png

We see that the penalty applied on the weights give a better results: the values of the coefficients do not suffer from numerical issues. Indeed, the matrix to be inverted internally is np.dot(data.T, data) + alpha * I. Adding this penalty alpha allow the inversion without numerical issue.

Can you find the relationship between the ridge coefficients and the original coefficients?

ridge.coef_[:5] * 3
array([10.89417991, 40.40406338, -0.61648035, -0.56789883,  0.33351616])

Repeating three times each informative features induced to divide the ridge coefficients by three.

Tip

We always advise to use l2-penalized model instead of non-penalized model in practice. In scikit-learn, LogisticRegression applies such penalty by default. However, one needs to use Ridge (and even RidgeCV to tune the parameter alpha) instead of LinearRegression.