π Solution for Exercise M4.02#
The goal of this exercise is to build an intuition on what will be the parametersβ values of a linear model when the link between the data and the target is non-linear.
First, we will generate such non-linear data.
Tip
np.random.RandomState
allows to create a random number generator which can
be later used to get deterministic results.
import numpy as np
# Set the seed for reproduction
rng = np.random.RandomState(0)
# Generate data
n_sample = 100
data_max, data_min = 1.4, -1.4
len_data = data_max - data_min
data = rng.rand(n_sample) * len_data - len_data / 2
noise = rng.randn(n_sample) * 0.3
target = data**3 - 0.5 * data**2 + noise
Note
To ease the plotting, we will create a Pandas dataframe containing the data and target
import pandas as pd
full_data = pd.DataFrame({"data": data, "target": target})
import seaborn as sns
_ = sns.scatterplot(
data=full_data, x="data", y="target", color="black", alpha=0.5
)

We observe that the link between the data data
and vector target
is
non-linear. For instance, data
could represent the years of experience
(normalized) and target
the salary (normalized). Therefore, the problem here
would be to infer the salary given the years of experience.
Using the function f
defined below, find both the weight
and the
intercept
that you think will lead to a good linear model. Plot both the
data and the predictions of this model.
def f(data, weight=0, intercept=0):
target_predict = weight * data + intercept
return target_predict
# solution
predictions = f(data, weight=1.2, intercept=-0.2)
ax = sns.scatterplot(
data=full_data, x="data", y="target", color="black", alpha=0.5
)
_ = ax.plot(data, predictions)

Compute the mean squared error for this model
# solution
from sklearn.metrics import mean_squared_error
error = mean_squared_error(target, f(data, weight=1.2, intercept=-0.2))
print(f"The MSE is {error}")
The MSE is 0.38118083900814376
Train a linear regression model on this dataset.
Warning
In scikit-learn, by convention data
(also called X
in the scikit-learn
documentation) should be a 2D matrix of shape (n_samples, n_features)
.
If data
is a 1D vector, you need to reshape it into a matrix with a
single column if the vector represents a feature or a single row if the
vector represents a sample.
from sklearn.linear_model import LinearRegression
# solution
linear_regression = LinearRegression()
data_2d = data.reshape(-1, 1)
linear_regression.fit(data_2d, target)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Compute predictions from the linear regression model and plot both the data and the predictions.
# solution
predictions = linear_regression.predict(data_2d)
ax = sns.scatterplot(
data=full_data, x="data", y="target", color="black", alpha=0.5
)
_ = ax.plot(data, predictions)

Compute the mean squared error
# solution
error = mean_squared_error(target, predictions)
print(f"The MSE is {error}")
The MSE is 0.37117544002508424