πŸ“ƒ Solution for Exercise M4.02#

The goal of this exercise is to build an intuition on what will be the parameters’ values of a linear model when the link between the data and the target is non-linear.

First, we will generate such non-linear data.

Tip

np.random.RandomState allows to create a random number generator which can be later used to get deterministic results.

import numpy as np

# Set the seed for reproduction
rng = np.random.RandomState(0)

# Generate data
n_sample = 100
data_max, data_min = 1.4, -1.4
len_data = data_max - data_min
data = rng.rand(n_sample) * len_data - len_data / 2
noise = rng.randn(n_sample) * 0.3
target = data**3 - 0.5 * data**2 + noise

Note

To ease the plotting, we will create a Pandas dataframe containing the data and target

import pandas as pd

full_data = pd.DataFrame({"data": data, "target": target})
import seaborn as sns

_ = sns.scatterplot(
    data=full_data, x="data", y="target", color="black", alpha=0.5
)
../_images/409a26708515e2dbdf00656798309a1c1a15f15ced9b7a0f84f9b44e28f6ee3a.png

We observe that the link between the data data and vector target is non-linear. For instance, data could represent the years of experience (normalized) and target the salary (normalized). Therefore, the problem here would be to infer the salary given the years of experience.

Using the function f defined below, find both the weight and the intercept that you think will lead to a good linear model. Plot both the data and the predictions of this model.

def f(data, weight=0, intercept=0):
    target_predict = weight * data + intercept
    return target_predict
# solution
predictions = f(data, weight=1.2, intercept=-0.2)
ax = sns.scatterplot(
    data=full_data, x="data", y="target", color="black", alpha=0.5
)
_ = ax.plot(data, predictions)
../_images/517742b79cafdb6b6e751566f830ad217eba819594595886ff8315314b3b7c98.png

Compute the mean squared error for this model

# solution
from sklearn.metrics import mean_squared_error

error = mean_squared_error(target, f(data, weight=1.2, intercept=-0.2))
print(f"The MSE is {error}")
The MSE is 0.38118083900814376

Train a linear regression model on this dataset.

Warning

In scikit-learn, by convention data (also called X in the scikit-learn documentation) should be a 2D matrix of shape (n_samples, n_features). If data is a 1D vector, you need to reshape it into a matrix with a single column if the vector represents a feature or a single row if the vector represents a sample.

from sklearn.linear_model import LinearRegression

# solution
linear_regression = LinearRegression()
data_2d = data.reshape(-1, 1)
linear_regression.fit(data_2d, target)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Compute predictions from the linear regression model and plot both the data and the predictions.

# solution
predictions = linear_regression.predict(data_2d)
ax = sns.scatterplot(
    data=full_data, x="data", y="target", color="black", alpha=0.5
)
_ = ax.plot(data, predictions)
../_images/7e0234b4d117f4f37768e015f59b7337d73354532a32b62a2c011dd373e898f8.png

Compute the mean squared error

# solution
error = mean_squared_error(target, predictions)
print(f"The MSE is {error}")
The MSE is 0.37117544002508424