π Exercise M3.02#
The goal is to find the best set of hyperparameters which maximize the generalization performance on a training set.
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
data, target = fetch_california_housing(return_X_y=True, as_frame=True)
target *= 100 # rescale the target in k$
data_train, data_test, target_train, target_test = train_test_split(
data, target, random_state=42
)
In this exercise, we progressively define the regression pipeline and later tune its hyperparameters.
Start by defining a pipeline that:
uses a
StandardScalerto normalize the numerical data;uses a
sklearn.neighbors.KNeighborsRegressoras a predictive model.
# Write your code here.
Use RandomizedSearchCV with n_iter=20 and
scoring="neg_mean_absolute_error" to tune the following hyperparameters
of the model:
the parameter
n_neighborsof theKNeighborsRegressorwith valuesnp.logspace(0, 3, num=10).astype(np.int32);the parameter
with_meanof theStandardScalerwith possible valuesTrueorFalse;the parameter
with_stdof theStandardScalerwith possible valuesTrueorFalse.
The scoring function is expected to return higher values for better models,
since grid/random search objects maximize it. Because of that, error
metrics like mean_absolute_error must be negated (using the neg_ prefix)
to work correctly (remember lower errors represent better models).
Notice that in the notebook βHyperparameter tuning by randomized-searchβ we
pass distributions to be sampled by the RandomizedSearchCV. In this case we
define a fixed grid of hyperparameters to be explored. Using a GridSearchCV
instead would explore all the possible combinations on the grid, which can be
costly to compute for large grids, whereas the parameter n_iter of the
RandomizedSearchCV controls the number of different random combination that
are evaluated. Notice that setting n_iter larger than the number of possible
combinations in a grid (in this case 10 x 2 x 2 = 40) would lead to repeating
already-explored combinations.
Once the computation has completed, print the best combination of parameters
stored in the best_params_ attribute.
# Write your code here.