π Exercise M4.04#
In the previous notebook, we saw the effect of applying some regularization on the coefficient of a linear model.
In this exercise, we will study the advantage of using some regularization when dealing with correlated features.
We will first create a regression dataset. This dataset will contain 2,000 samples and 5 features from which only 2 features will be informative.
from sklearn.datasets import make_regression
data, target, coef = make_regression(
n_samples=2_000,
n_features=5,
n_informative=2,
shuffle=False,
coef=True,
random_state=0,
noise=30,
)
When creating the dataset, make_regression
returns the true coefficient used
to generate the dataset. Letβs plot this information.
import pandas as pd
feature_names = [
"Relevant feature #0",
"Relevant feature #1",
"Noisy feature #0",
"Noisy feature #1",
"Noisy feature #2",
]
coef = pd.Series(coef, index=feature_names)
coef.plot.barh()
coef
Relevant feature #0 9.566665
Relevant feature #1 40.192077
Noisy feature #0 0.000000
Noisy feature #1 0.000000
Noisy feature #2 0.000000
dtype: float64

Create a LinearRegression
regressor and fit on the entire dataset and check
the value of the coefficients. Are the coefficients of the linear regressor
close to the coefficients used to generate the dataset?
# Write your code here.
Now, create a new dataset that will be the same as data
with 4 additional
columns that will repeat twice features 0 and 1. This procedure will create
perfectly correlated features.
# Write your code here.
Fit again the linear regressor on this new dataset and check the coefficients. What do you observe?
# Write your code here.
Create a ridge regressor and fit on the same dataset. Check the coefficients. What do you observe?
# Write your code here.
Can you find the relationship between the ridge coefficients and the original coefficients?
# Write your code here.