📃 Solution for Exercise M4.01

📃 Solution for Exercise M4.01#

The aim of this exercise is two-fold:

  • understand the parametrization of a linear model;

  • quantify the fitting accuracy of a set of such models.

We will reuse part of the code of the course to:

  • load data;

  • create the function representing a linear model.

Prerequisites#

Data loading#

Note

If you want a deeper overview regarding this dataset, you can refer to the Appendix - Datasets description section at the end of this MOOC.

import pandas as pd

penguins = pd.read_csv("../datasets/penguins_regression.csv")
feature_name = "Flipper Length (mm)"
target_name = "Body Mass (g)"
data, target = penguins[[feature_name]], penguins[target_name]

Model definition#

def linear_model_flipper_mass(
    flipper_length, weight_flipper_length, intercept_body_mass
):
    """Linear model of the form y = a * x + b"""
    body_mass = weight_flipper_length * flipper_length + intercept_body_mass
    return body_mass

Main exercise#

Define a vector weights = [...] and a vector intercepts = [...] of the same length. Each pair of entries (weights[i], intercepts[i]) tags a different model. Use these vectors along with the vector flipper_length_range to plot several linear models that could possibly fit our data. Use the above helper function to visualize both the models and the real samples.

import numpy as np

flipper_length_range = np.linspace(data.min(), data.max(), num=300)
# solution
import matplotlib.pyplot as plt
import seaborn as sns

weights = [-40, 45, 90]
intercepts = [15000, -5000, -14000]

ax = sns.scatterplot(
    data=penguins, x=feature_name, y=target_name, color="black", alpha=0.5
)

label = "{0:.2f} (g / mm) * flipper length + {1:.2f} (g)"
for weight, intercept in zip(weights, intercepts):
    predicted_body_mass = linear_model_flipper_mass(
        flipper_length_range, weight, intercept
    )

    ax.plot(
        flipper_length_range,
        predicted_body_mass,
        label=label.format(weight, intercept),
    )
_ = ax.legend(loc="center left", bbox_to_anchor=(-0.25, 1.25), ncol=1)
../_images/284d07f1d774552c42bd193c4eeaaba2b1da7c9f74d2d2cd9a38d23f21fe8c55.png

In the previous question, you were asked to create several linear models. The visualization allowed you to qualitatively assess if a model was better than another.

Now, you should come up with a quantitative measure which indicates the goodness of fit of each linear model and allows you to select the best model. Define a function goodness_fit_measure(true_values, predictions) that takes as inputs the true target values and the predictions and returns a single scalar as output.

# solution
def goodness_fit_measure(true_values, predictions):
    # we compute the error between the true values and the predictions of our
    # model
    errors = np.ravel(true_values) - np.ravel(predictions)
    # We have several possible strategies to reduce all errors to a single value.
    # Computing the mean error (sum divided by the number of element) might seem
    # like a good solution. However, we have negative errors that will misleadingly
    # reduce the mean error. Therefore, we can either square each
    # error or take the absolute value: these metrics are known as mean
    # squared error (MSE) and mean absolute error (MAE). Let's use the MAE here
    # as an example.
    return np.mean(np.abs(errors))

You can now copy and paste the code below to show the goodness of fit for each model.

for model_idx, (weight, intercept) in enumerate(zip(weights, intercepts)):
    target_predicted = linear_model_flipper_mass(data, weight, intercept)
    print(f"Model #{model_idx}:")
    print(f"{weight:.2f} (g / mm) * flipper length + {intercept:.2f} (g)")
    print(f"Error: {goodness_fit_measure(target, target_predicted):.3f}\n")
# solution
for model_idx, (weight, intercept) in enumerate(zip(weights, intercepts)):
    target_predicted = linear_model_flipper_mass(data, weight, intercept)
    print(f"Model #{model_idx}:")
    print(f"{weight:.2f} (g / mm) * flipper length + {intercept:.2f} (g)")
    print(f"Error: {goodness_fit_measure(target, target_predicted):.3f}\n")
Model #0:
-40.00 (g / mm) * flipper length + 15000.00 (g)
Error: 2764.854

Model #1:
45.00 (g / mm) * flipper length + -5000.00 (g)
Error: 338.523

Model #2:
90.00 (g / mm) * flipper length + -14000.00 (g)
Error: 573.041