πŸ“ƒ Solution for Exercise M3.02#

The goal is to find the best set of hyperparameters which maximize the generalization performance on a training set.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data, target = fetch_california_housing(return_X_y=True, as_frame=True)
target *= 100  # rescale the target in k$

data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42)

In this exercise, we will progressively define the regression pipeline and later tune its hyperparameters.

Start by defining a pipeline that:

  • uses a StandardScaler to normalize the numerical data;

  • uses a sklearn.neighbors.KNeighborsRegressor as a predictive model.

# solution
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor

scaler = StandardScaler()
model = make_pipeline(scaler, KNeighborsRegressor())

Use RandomizedSearchCV with n_iter=20 to find the best set of hyperparameters by tuning the following parameters of the model:

  • the parameter n_neighbors of the KNeighborsRegressor with values np.logspace(0, 3, num=10).astype(np.int32);

  • the parameter with_mean of the StandardScaler with possible values True or False;

  • the parameter with_std of the StandardScaler with possible values True or False.

Notice that in the notebook β€œHyperparameter tuning by randomized-search” we pass distributions to be sampled by the RandomizedSearchCV. In this case we define a fixed grid of hyperparameters to be explored. Using a GridSearchCV instead would explore all the possible combinations on the grid, which can be costly to compute for large grids, whereas the parameter n_iter of the RandomizedSearchCV controls the number of different random combination that are evaluated. Notice that setting n_iter larger than the number of possible combinations in a grid (in this case 10 x 2 x 2 = 40) would lead to repeating already-explored combinations.

Once the computation has completed, print the best combination of parameters stored in the best_params_ attribute.

# solution
import numpy as np
from sklearn.model_selection import RandomizedSearchCV

param_distributions = {
    "kneighborsregressor__n_neighbors": np.logspace(0, 3, num=10).astype(np.int32),
    "standardscaler__with_mean": [True, False],
    "standardscaler__with_std": [True, False],
}

model_random_search = RandomizedSearchCV(
    model, param_distributions=param_distributions,
    n_iter=20, n_jobs=2, verbose=1, random_state=1)
model_random_search.fit(data_train, target_train)
model_random_search.best_params_
Fitting 5 folds for each of 20 candidates, totalling 100 fits
{'standardscaler__with_std': True,
 'standardscaler__with_mean': False,
 'kneighborsregressor__n_neighbors': 10}

So the best hyperparameters give a model where the features are scaled but not centered.

Getting the best parameter combinations is the main outcome of the hyper-parameter optimization procedure. However it is also interesting to assess the sensitivity of the best models to the choice of those parameters. The following code, not required to answer the quiz question shows how to conduct such an interactive analysis for this this pipeline using a parallel coordinate plot using the plotly library.

We could use cv_results = model_random_search.cv_results_ to make a parallel coordinate plot as we did in the previous notebook (you are more than welcome to try!).

import pandas as pd

cv_results = pd.DataFrame(model_random_search.cv_results_)

To simplify the axis of the plot, we will rename the column of the dataframe and only select the mean test score and the value of the hyperparameters.

column_name_mapping = {
    "param_kneighborsregressor__n_neighbors": "n_neighbors",
    "param_standardscaler__with_mean": "centering",
    "param_standardscaler__with_std": "scaling",
    "mean_test_score": "mean test score",
}

cv_results = cv_results.rename(columns=column_name_mapping)
cv_results = cv_results[column_name_mapping.values()].sort_values(
    "mean test score", ascending=False)

In addition, the parallel coordinate plot from plotly expects all data to be numeric. Thus, we convert the boolean indicator informing whether or not the data were centered or scaled into an integer, where True is mapped to 1 and False is mapped to 0. As n_neighbors has dtype=object, we also convert it explicitly to an integer.

column_scaler = ["centering", "scaling"]
cv_results[column_scaler] = cv_results[column_scaler].astype(np.int64)
cv_results["n_neighbors"] = cv_results["n_neighbors"].astype(np.int64)
cv_results
n_neighbors centering scaling mean test score
17 10 0 1 0.687926
18 4 0 1 0.674812
6 46 0 1 0.668778
9 100 0 1 0.648317
16 2 1 1 0.629772
15 215 1 1 0.617295
12 215 0 1 0.617295
10 464 1 1 0.567164
0 1 0 1 0.508809
13 1000 1 1 0.486503
8 21 0 0 0.103390
11 21 1 0 0.103390
3 46 1 0 0.061394
4 100 0 0 0.033122
1 215 0 0 0.017583
5 215 1 0 0.017583
14 464 1 0 0.007987
19 464 0 0 0.007987
7 1000 0 0 0.002900
2 1 0 0 -0.238830
import plotly.express as px

fig = px.parallel_coordinates(
    cv_results,
    color="mean test score",
    dimensions=["n_neighbors", "centering", "scaling", "mean test score"],
    color_continuous_scale=px.colors.diverging.Tealrose,
)
fig.show()
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/plotly/express/_core.py:279: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  dims = [