π Exercise M4.02#
In the previous notebook, we showed that we can add new features based on the
original feature x
to make the model more expressive, for instance x ** 2
or
x ** 3
. In that case we only used a single feature in data
.
The aim of this notebook is to train a linear regression algorithm on a
dataset with more than a single feature. In such a βmulti-dimensionalβ feature
space we can derive new features of the form x1 * x2
, x2 * x3
, etc.
Products of features are usually called βnon-linearβ or βmultiplicativeβ
interactions between features.
Feature engineering can be an important step of a model pipeline as long as
the new features are expected to be predictive. For instance, think of a
classification model to decide if a patient has risk of developing a heart
disease. This would depend on the patientβs Body Mass Index which is defined
as weight / height ** 2
.
We load the dataset penguins dataset. We first use a set of 3 numerical features to predict the target, i.e. the body mass of the penguin.
Note
If you want a deeper overview regarding this dataset, you can refer to the Appendix - Datasets description section at the end of this MOOC.
import pandas as pd
penguins = pd.read_csv("../datasets/penguins.csv")
columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"]
target_name = "Body Mass (g)"
# Remove lines with missing values for the columns of interest
penguins_non_missing = penguins[columns + [target_name]].dropna()
data = penguins_non_missing[columns]
target = penguins_non_missing[target_name]
data
Flipper Length (mm) | Culmen Length (mm) | Culmen Depth (mm) | |
---|---|---|---|
0 | 181.0 | 39.1 | 18.7 |
1 | 186.0 | 39.5 | 17.4 |
2 | 195.0 | 40.3 | 18.0 |
4 | 193.0 | 36.7 | 19.3 |
5 | 190.0 | 39.3 | 20.6 |
... | ... | ... | ... |
339 | 207.0 | 55.8 | 19.8 |
340 | 202.0 | 43.5 | 18.1 |
341 | 193.0 | 49.6 | 18.2 |
342 | 210.0 | 50.8 | 19.0 |
343 | 198.0 | 50.2 | 18.7 |
342 rows Γ 3 columns
Now it is your turn to train a linear regression model on this dataset. First, create a linear regression model.
# Write your code here.
Execute a cross-validation with 10 folds and use the mean absolute error (MAE) as metric.
# Write your code here.
Compute the mean and std of the MAE in grams (g). Remember you have to revert
the sign introduced when metrics start with neg_
, such as in
"neg_mean_absolute_error"
.
# Write your code here.
Now create a pipeline using make_pipeline
consisting of a
PolynomialFeatures
and a linear regression. Set degree=2
and
interaction_only=True
to the feature engineering step. Remember not to
include a βbiasβ feature (that is a constant-valued feature) to avoid
introducing a redundancy with the intercept of the subsequent linear
regression model.
You may want to use the .set_output(transform="pandas")
method of the
pipeline to answer the next question.
# Write your code here.
Transform the first 5 rows of the dataset and look at the column names. How
many features are generated at the output of the PolynomialFeatures
step in
the previous pipeline?
# Write your code here.
Check that the values for the new interaction features are correct for a few of them.
# Write your code here.
Use the same cross-validation strategy as done previously to estimate the mean and std of the MAE in grams (g) for such a pipeline. Compare with the results without feature engineering.
# Write your code here.
Now letβs try to build an alternative pipeline with an adjustable number of
intermediate features while keeping a similar predictive power. To do so, try
using the Nystroem
transformer instead of PolynomialFeatures
. Set the
kernel parameter to "poly"
and degree
to 2. Adjust the number of
components to be as small as possible while keeping a good cross-validation
performance.
Hint: Use a ValidationCurveDisplay
with param_range = np.array([5, 10, 50, 100])
to find the optimal n_components
.
# Write your code here.
How do the mean and std of the MAE for the Nystroem pipeline with optimal
n_components
compare to the other previous models?
# Write your code here.