# 📃 Solution for Exercise M4.04¶

In the previous notebook, we saw the effect of applying some regularization on the coefficient of a linear model.

In this exercise, we will study the advantage of using some regularization when dealing with correlated features.

We will first create a regression dataset. This dataset will contain 2,000 samples and 5 features from which only 2 features will be informative.

```from sklearn.datasets import make_regression

data, target, coef = make_regression(
n_samples=2_000,
n_features=5,
n_informative=2,
shuffle=False,
coef=True,
random_state=0,
noise=30,
)
```

When creating the dataset, `make_regression` returns the true coefficient used to generate the dataset. Let’s plot this information.

```import pandas as pd

feature_names = [f"Features {i}" for i in range(data.shape)]
coef = pd.Series(coef, index=feature_names)
coef.plot.barh()
coef
```
```Features 0     9.566665
Features 1    40.192077
Features 2     0.000000
Features 3     0.000000
Features 4     0.000000
dtype: float64
``` Create a `LinearRegression` regressor and fit on the entire dataset and check the value of the coefficients. Are the coefficients of the linear regressor close to the coefficients used to generate the dataset?

```# solution
from sklearn.linear_model import LinearRegression

linear_regression = LinearRegression()
linear_regression.fit(data, target)
linear_regression.coef_
```
```array([10.89587004, 40.41128042, -0.20542454, -0.18954462,  0.11129768])
```
```feature_names = [f"Features {i}" for i in range(data.shape)]
coef = pd.Series(linear_regression.coef_, index=feature_names)
_ = coef.plot.barh()
``` We see that the coefficients are close to the coefficients used to generate the dataset. The dispersion is indeed cause by the noise injected during the dataset generation.

Now, create a new dataset that will be the same as `data` with 4 additional columns that will repeat twice features 0 and 1. This procedure will create perfectly correlated features.

```# solution
import numpy as np

data = np.concatenate([data, data[:, [0, 1]], data[:, [0, 1]]], axis=1)
```

Fit again the linear regressor on this new dataset and check the coefficients. What do you observe?

```# solution
linear_regression = LinearRegression()
linear_regression.fit(data, target)
linear_regression.coef_
```
```array([ 1.33594010e+12, -1.62497905e+14, -2.05078125e-01, -1.77612305e-01,
9.71679688e-02, -6.67970049e+11,  4.20600332e+13, -6.67970049e+11,
1.20437872e+14])
```
```feature_names = [f"Features {i}" for i in range(data.shape)]
coef = pd.Series(linear_regression.coef_, index=feature_names)
_ = coef.plot.barh()
``` We see that the coefficient values are far from what one could expect. By repeating the informative features, one would have expected these coefficients to be similarly informative.

Instead, we see that some coefficients have a huge norm ~1e14. It indeed means that we try to solve an mathematical ill-posed problem. Indeed, finding coefficients in a linear regression involves inverting the matrix `np.dot(data.T, data)` which is not possible (or lead to high numerical errors).

Create a ridge regressor and fit on the same dataset. Check the coefficients. What do you observe?

```# solution
from sklearn.linear_model import Ridge

ridge = Ridge()
ridge.fit(data, target)
ridge.coef_
```
```array([ 3.6313933 , 13.46802113, -0.20549345, -0.18929961,  0.11117205,
3.6313933 , 13.46802113,  3.6313933 , 13.46802113])
```
```coef = pd.Series(ridge.coef_, index=feature_names)
_ = coef.plot.barh()
``` We see that the penalty applied on the weights give a better results: the values of the coefficients do not suffer from numerical issues. Indeed, the matrix to be inverted internally is `np.dot(data.T, data) + alpha * I`. Adding this penalty `alpha` allow the inversion without numerical issue.

Can you find the relationship between the ridge coefficients and the original coefficients?

```# solution
ridge.coef_[:5] * 3
```
```array([10.89417991, 40.40406338, -0.61648035, -0.56789883,  0.33351616])
```

Repeating three times each informative features induced to divide the ridge coefficients by three.

Tip

We always advise to use l2-penalized model instead of non-penalized model in practice. In scikit-learn, `LogisticRegression` applies such penalty by default. However, one needs to use `Ridge` (and even `RidgeCV` to tune the parameter `alpha`) instead of `LinearRegression`.