# π Exercise M4.02#

In the previous notebook, we showed that we can add new features based on the
original feature `x`

to make the model more expressive, for instance `x ** 2`

or
`x ** 3`

. In that case we only used a single feature in `data`

.

The aim of this notebook is to train a linear regression algorithm on a
dataset with more than a single feature. In such a βmulti-dimensionalβ feature
space we can derive new features of the form `x1 * x2`

, `x2 * x3`

, etc.
Products of features are usually called βnon-linearβ or βmultiplicativeβ
interactions between features.

Feature engineering can be an important step of a model pipeline as long as
the new features are expected to be predictive. For instance, think of a
classification model to decide if a patient has risk of developing a heart
disease. This would depend on the patientβs Body Mass Index which is defined
as `weight / height ** 2`

.

We load the dataset penguins dataset. We first use a set of 3 numerical features to predict the target, i.e. the body mass of the penguin.

Note

If you want a deeper overview regarding this dataset, you can refer to the Appendix - Datasets description section at the end of this MOOC.

```
import pandas as pd
penguins = pd.read_csv("../datasets/penguins.csv")
columns = ["Flipper Length (mm)", "Culmen Length (mm)", "Culmen Depth (mm)"]
target_name = "Body Mass (g)"
# Remove lines with missing values for the columns of interest
penguins_non_missing = penguins[columns + [target_name]].dropna()
data = penguins_non_missing[columns]
target = penguins_non_missing[target_name]
data
```

Flipper Length (mm) | Culmen Length (mm) | Culmen Depth (mm) | |
---|---|---|---|

0 | 181.0 | 39.1 | 18.7 |

1 | 186.0 | 39.5 | 17.4 |

2 | 195.0 | 40.3 | 18.0 |

4 | 193.0 | 36.7 | 19.3 |

5 | 190.0 | 39.3 | 20.6 |

... | ... | ... | ... |

339 | 207.0 | 55.8 | 19.8 |

340 | 202.0 | 43.5 | 18.1 |

341 | 193.0 | 49.6 | 18.2 |

342 | 210.0 | 50.8 | 19.0 |

343 | 198.0 | 50.2 | 18.7 |

342 rows Γ 3 columns

Now it is your turn to train a linear regression model on this dataset. First, create a linear regression model.

```
# Write your code here.
```

Execute a cross-validation with 10 folds and use the mean absolute error (MAE) as metric.

```
# Write your code here.
```

Compute the mean and std of the MAE in grams (g). Remember you have to revert
the sign introduced when metrics start with `neg_`

, such as in
`"neg_mean_absolute_error"`

.

```
# Write your code here.
```

Now create a pipeline using `make_pipeline`

consisting of a
`PolynomialFeatures`

and a linear regression. Set `degree=2`

and
`interaction_only=True`

to the feature engineering step. Remember not to
include a βbiasβ feature (that is a constant-valued feature) to avoid
introducing a redundancy with the intercept of the subsequent linear
regression model.

You may want to use the `.set_output(transform="pandas")`

method of the
pipeline to answer the next question.

```
# Write your code here.
```

Transform the first 5 rows of the dataset and look at the column names. How
many features are generated at the output of the `PolynomialFeatures`

step in
the previous pipeline?

```
# Write your code here.
```

Check that the values for the new interaction features are correct for a few of them.

```
# Write your code here.
```

Use the same cross-validation strategy as done previously to estimate the mean and std of the MAE in grams (g) for such a pipeline. Compare with the results without feature engineering.

```
# Write your code here.
```

Now letβs try to build an alternative pipeline with an adjustable number of
intermediate features while keeping a similar predictive power. To do so, try
using the `Nystroem`

transformer instead of `PolynomialFeatures`

. Set the
kernel parameter to `"poly"`

and `degree`

to 2. Adjust the number of
components to be as small as possible while keeping a good cross-validation
performance.

Hint: Use a `ValidationCurveDisplay`

with `param_range = np.array([5, 10, 50, 100])`

to find the optimal `n_components`

.

```
# Write your code here.
```

How do the mean and std of the MAE for the Nystroem pipeline with optimal
`n_components`

compare to the other previous models?

```
# Write your code here.
```