# π Exercise M4.02#

In the previous notebook, we showed that we can add new features based on the
original feature `x`

to make the model more expressive, for instance `x ** 2`

or
`x ** 3`

. In that case we only used a single feature in `data`

.

The aim of this notebook is to train a linear regression algorithm on a
dataset with more than a single feature. In such a βmulti-dimensionalβ feature
space we can derive new features of the form `x1 * x2`

, `x2 * x3`

, etc.
Products of features are usually called βnon-linearβ or βmultiplicativeβ
interactions between features.

Feature engineering can be an important step of a model pipeline as long as
the new features are expected to be predictive. For instance, think of a
classification model to decide if a patient has risk of developing a heart
disease. This would depend on the patientβs Body Mass Index which is defined
as `weight / height ** 2`

.

We load the dataset penguins dataset. We first use a set of 3 numerical features to predict the target, i.e. the body mass of the penguin.

Note

If you want a deeper overview regarding this dataset, you can refer to the Appendix - Datasets description section at the end of this MOOC.

```
import pandas as pd
penguins = pd.read_csv("../datasets/penguins.csv")
columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"]
target_name = "Body Mass (g)"
# Remove lines with missing values for the columns of interest
penguins_non_missing = penguins[columns + [target_name]].dropna()
data = penguins_non_missing[columns]
target = penguins_non_missing[target_name]
data.head()
```

```
/tmp/ipykernel_4863/2869420921.py:1: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
import pandas as pd
```

Flipper Length (mm) | Culmen Length (mm) | Culmen Depth (mm) | |
---|---|---|---|

0 | 181.0 | 39.1 | 18.7 |

1 | 186.0 | 39.5 | 17.4 |

2 | 195.0 | 40.3 | 18.0 |

4 | 193.0 | 36.7 | 19.3 |

5 | 190.0 | 39.3 | 20.6 |

Now it is your turn to train a linear regression model on this dataset. First, create a linear regression model.

```
# Write your code here.
```

Execute a cross-validation with 10 folds and use the mean absolute error (MAE) as metric.

```
# Write your code here.
```

Compute the mean and std of the MAE in grams (g). Remember you have to revert
the sign introduced when metrics start with `neg_`

, such as in
`"neg_mean_absolute_error"`

.

```
# Write your code here.
```

Now create a pipeline using `make_pipeline`

consisting of a
`PolynomialFeatures`

and a linear regression. Set `degree=2`

and
`interaction_only=True`

to the feature engineering step. Remember not to
include a βbiasβ feature (that is a constant-valued feature) to avoid
introducing a redundancy with the intercept of the subsequent linear
regression model.

You may want to use the `.set_output(transform="pandas")`

method of the
pipeline to answer the next question.

```
# Write your code here.
```

Transform the first 5 rows of the dataset and look at the column names. How
many features are generated at the output of the `PolynomialFeatures`

step in
the previous pipeline?

```
# Write your code here.
```

Check that the values for the new interaction features are correct for a few of them.

```
# Write your code here.
```

Use the same cross-validation strategy as done previously to estimate the mean and std of the MAE in grams (g) for such a pipeline. Compare with the results without feature engineering.

```
# Write your code here.
```

Now letβs try to build an alternative pipeline with an adjustable number of
intermediate features while keeping a similar predictive power. To do so, try
using the `Nystroem`

transformer instead of `PolynomialFeatures`

. Set the
kernel parameter to `"poly"`

and `degree`

to 2. Adjust the number of
components to be as small as possible while keeping a good cross-validation
performance.

Hint: Use a `ValidationCurveDisplay`

with `param_range = np.array([5, 10, 50, 100])`

to find the optimal `n_components`

.

```
# Write your code here.
```

How do the mean and std of the MAE for the Nystroem pipeline with optimal
`n_components`

compare to the other previous models?

```
# Write your code here.
```