# ---
# jupyter:
# kernelspec:
# display_name: Python 3
# name: python3
# ---
# %% [markdown]
# # Linear regression without scikit-learn
#
# In this notebook, we introduce linear regression. Before presenting the
# available scikit-learn classes, we will provide some insights with a simple
# example. We will use a dataset that contains measurements taken on penguins.
# %% [markdown]
# ```{note}
# If you want a deeper overview regarding this dataset, you can refer to the
# Appendix - Datasets description section at the end of this MOOC.
# ```
# %%
import pandas as pd
penguins = pd.read_csv("../datasets/penguins_regression.csv")
penguins.head()
# %% [markdown]
# We will formulate the following problem: using the flipper length of a
# penguin, we would like to infer its mass.
# %%
import seaborn as sns
feature_name = "Flipper Length (mm)"
target_name = "Body Mass (g)"
data, target = penguins[[feature_name]], penguins[target_name]
ax = sns.scatterplot(
data=penguins, x=feature_name, y=target_name, color="black", alpha=0.5
)
ax.set_title("Body Mass as a function of the Flipper Length")
# %% [markdown]
# ```{tip}
# The function `scatterplot` from seaborn take as input the full dataframe and
# the parameter `x` and `y` allows to specify the name of the columns to be
# plotted. Note that this function returns a matplotlib axis (named `ax` in the
# example above) that can be further used to add elements on the same matplotlib
# axis (such as a title).
# ```
# %% [markdown]
# In this problem, penguin mass is our target. It is a continuous variable that
# roughly varies between 2700 g and 6300 g. Thus, this is a regression problem
# (in contrast to classification). We also see that there is almost a linear
# relationship between the body mass of the penguin and its flipper length. The
# longer the flipper, the heavier the penguin.
#
# Thus, we could come up with a simple formula, where given a flipper length we
# could compute the body mass of a penguin using a linear relationship of the
# form `y = a * x + b` where `a` and `b` are the 2 parameters of our model.
# %%
def linear_model_flipper_mass(
flipper_length, weight_flipper_length, intercept_body_mass
):
"""Linear model of the form y = a * x + b"""
body_mass = weight_flipper_length * flipper_length + intercept_body_mass
return body_mass
# %% [markdown]
# Using the model we defined above, we can check the body mass values predicted
# for a range of flipper lengths. We will set `weight_flipper_length` to be 45
# and `intercept_body_mass` to be -5000.
# %%
import numpy as np
weight_flipper_length = 45
intercept_body_mass = -5000
flipper_length_range = np.linspace(data.min(), data.max(), num=300)
predicted_body_mass = linear_model_flipper_mass(
flipper_length_range, weight_flipper_length, intercept_body_mass
)
# %% [markdown]
# We can now plot all samples and the linear model prediction.
# %%
label = "{0:.2f} (g / mm) * flipper length + {1:.2f} (g)"
ax = sns.scatterplot(
data=penguins, x=feature_name, y=target_name, color="black", alpha=0.5
)
ax.plot(flipper_length_range, predicted_body_mass)
_ = ax.set_title(label.format(weight_flipper_length, intercept_body_mass))
# %% [markdown]
# The variable `weight_flipper_length` is a weight applied to the feature
# `flipper_length` in order to make the inference. When this coefficient is
# positive, it means that penguins with longer flipper lengths will have larger
# body masses. If the coefficient is negative, it means that penguins with
# shorter flipper lengths have larger body masses. Graphically, this coefficient
# is represented by the slope of the curve in the plot. Below we show what the
# curve would look like when the `weight_flipper_length` coefficient is
# negative.
# %%
weight_flipper_length = -40
intercept_body_mass = 13000
predicted_body_mass = linear_model_flipper_mass(
flipper_length_range, weight_flipper_length, intercept_body_mass
)
# %% [markdown]
# We can now plot all samples and the linear model prediction.
# %%
ax = sns.scatterplot(
data=penguins, x=feature_name, y=target_name, color="black", alpha=0.5
)
ax.plot(flipper_length_range, predicted_body_mass)
_ = ax.set_title(label.format(weight_flipper_length, intercept_body_mass))
# %% [markdown]
# In our case, this coefficient has a meaningful unit: g/mm. For instance, a
# coefficient of 40 g/mm, means that for each additional millimeter in flipper
# length, the body weight predicted will increase by 40 g.
# %%
body_mass_180 = linear_model_flipper_mass(
flipper_length=180, weight_flipper_length=40, intercept_body_mass=0
)
body_mass_181 = linear_model_flipper_mass(
flipper_length=181, weight_flipper_length=40, intercept_body_mass=0
)
print(
"The body mass for a flipper length of 180 mm "
f"is {body_mass_180} g and {body_mass_181} g "
"for a flipper length of 181 mm"
)
# %% [markdown]
# We can also see that we have a parameter `intercept_body_mass` in our model.
# This parameter corresponds to the value on the y-axis if `flipper_length=0`
# (which in our case is only a mathematical consideration, as in our data, the
# value of `flipper_length` only goes from 170mm to 230mm). This y-value when
# x=0 is called the y-intercept. If `intercept_body_mass` is 0, the curve will
# pass through the origin:
# %%
weight_flipper_length = 25
intercept_body_mass = 0
# redefined the flipper length to start at 0 to plot the intercept value
flipper_length_range = np.linspace(0, data.max(), num=300)
predicted_body_mass = linear_model_flipper_mass(
flipper_length_range, weight_flipper_length, intercept_body_mass
)
# %%
ax = sns.scatterplot(
data=penguins, x=feature_name, y=target_name, color="black", alpha=0.5
)
ax.plot(flipper_length_range, predicted_body_mass)
_ = ax.set_title(label.format(weight_flipper_length, intercept_body_mass))
# %% [markdown]
# Otherwise, it will pass through the `intercept_body_mass` value:
# %%
weight_flipper_length = 45
intercept_body_mass = -5000
predicted_body_mass = linear_model_flipper_mass(
flipper_length_range, weight_flipper_length, intercept_body_mass
)
# %%
ax = sns.scatterplot(
data=penguins, x=feature_name, y=target_name, color="black", alpha=0.5
)
ax.plot(flipper_length_range, predicted_body_mass)
_ = ax.set_title(label.format(weight_flipper_length, intercept_body_mass))
# %% [markdown]
# In this notebook, we have seen the parametrization of a linear regression
# model and more precisely meaning of the terms weights and intercepts.