Hyperparameter tuning by grid-search#
In the previous notebook, we saw that hyperparameters can affect the generalization performance of a model. In this notebook, we show how to optimize hyperparameters using a grid-search approach.
Our predictive model#
Let us reload the dataset as we did previously:
import pandas as pd
adult_census = pd.read_csv("../datasets/adult-census.csv")
We extract the column containing the target.
target_name = "class"
target = adult_census[target_name]
target
0 <=50K
1 <=50K
2 >50K
3 >50K
4 <=50K
...
48837 <=50K
48838 >50K
48839 <=50K
48840 <=50K
48841 >50K
Name: class, Length: 48842, dtype: object
We drop from our data the target and the "education-num"
column which
duplicates the information from the "education"
column.
data = adult_census.drop(columns=[target_name, "education-num"])
data
age | workclass | education | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 25 | Private | 11th | Never-married | Machine-op-inspct | Own-child | Black | Male | 0 | 0 | 40 | United-States |
1 | 38 | Private | HS-grad | Married-civ-spouse | Farming-fishing | Husband | White | Male | 0 | 0 | 50 | United-States |
2 | 28 | Local-gov | Assoc-acdm | Married-civ-spouse | Protective-serv | Husband | White | Male | 0 | 0 | 40 | United-States |
3 | 44 | Private | Some-college | Married-civ-spouse | Machine-op-inspct | Husband | Black | Male | 7688 | 0 | 40 | United-States |
4 | 18 | ? | Some-college | Never-married | ? | Own-child | White | Female | 0 | 0 | 30 | United-States |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
48837 | 27 | Private | Assoc-acdm | Married-civ-spouse | Tech-support | Wife | White | Female | 0 | 0 | 38 | United-States |
48838 | 40 | Private | HS-grad | Married-civ-spouse | Machine-op-inspct | Husband | White | Male | 0 | 0 | 40 | United-States |
48839 | 58 | Private | HS-grad | Widowed | Adm-clerical | Unmarried | White | Female | 0 | 0 | 40 | United-States |
48840 | 22 | Private | HS-grad | Never-married | Adm-clerical | Own-child | White | Male | 0 | 0 | 20 | United-States |
48841 | 52 | Self-emp-inc | HS-grad | Married-civ-spouse | Exec-managerial | Wife | White | Female | 15024 | 0 | 40 | United-States |
48842 rows Γ 12 columns
Once the dataset is loaded, we split it into a training and testing sets.
from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = train_test_split(
data, target, random_state=42
)
We define a pipeline as seen in the first module, to handle both numerical and categorical features.
The first step is to select all the categorical columns.
from sklearn.compose import make_column_selector as selector
categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)
Here we use a tree-based model as a classifier (i.e.
HistGradientBoostingClassifier
). That means:
Numerical variables donβt need scaling;
Categorical variables can be dealt with an
OrdinalEncoder
even if the coding order is not meaningful;For tree-based models, the
OrdinalEncoder
avoids having high-dimensional representations.
We now build our OrdinalEncoder
by passing it the known categories.
from sklearn.preprocessing import OrdinalEncoder
categorical_preprocessor = OrdinalEncoder(
handle_unknown="use_encoded_value", unknown_value=-1
)
We then use a ColumnTransformer
to select the categorical columns and apply
the OrdinalEncoder
to them.
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
[("cat_preprocessor", categorical_preprocessor, categorical_columns)],
remainder="passthrough",
)
Finally, we use a tree-based classifier (i.e. histogram gradient-boosting) to predict whether or not a person earns more than 50 k$ a year.
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline
model = Pipeline(
[
("preprocessor", preprocessor),
(
"classifier",
HistGradientBoostingClassifier(random_state=42, max_leaf_nodes=4),
),
]
)
model
Pipeline(steps=[('preprocessor', ColumnTransformer(remainder='passthrough', transformers=[('cat_preprocessor', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country'])])), ('classifier', HistGradientBoostingClassifier(max_leaf_nodes=4, random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor', ColumnTransformer(remainder='passthrough', transformers=[('cat_preprocessor', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country'])])), ('classifier', HistGradientBoostingClassifier(max_leaf_nodes=4, random_state=42))])
ColumnTransformer(remainder='passthrough', transformers=[('cat_preprocessor', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country'])])
['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
passthrough
HistGradientBoostingClassifier(max_leaf_nodes=4, random_state=42)
Tuning using a grid-search#
In the previous exercise we used one for
loop for each hyperparameter to
find the best combination over a fixed grid of values. GridSearchCV
is a
scikit-learn class that implements a very similar logic with less repetitive
code.
Letβs see how to use the GridSearchCV
estimator for doing such search. Since
the grid-search is costly, we only explore the combination learning-rate and
the maximum number of nodes.
%%time
from sklearn.model_selection import GridSearchCV
param_grid = {
"classifier__learning_rate": (0.01, 0.1, 1, 10),
"classifier__max_leaf_nodes": (3, 10, 30),
}
model_grid_search = GridSearchCV(model, param_grid=param_grid, n_jobs=2, cv=2)
model_grid_search.fit(data_train, target_train)
CPU times: user 1.02 s, sys: 65.8 ms, total: 1.09 s
Wall time: 5.5 s
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py:1623: FutureWarning:
The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).
To use the new behavior now and suppress this warning, use ColumnTransformer(force_int_remainder_cols=False).
warnings.warn(
GridSearchCV(cv=2, estimator=Pipeline(steps=[('preprocessor', ColumnTransformer(remainder='passthrough', transformers=[('cat_preprocessor', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country'])])), ('classifier', HistGradientBoostingClassifier(max_leaf_nodes=4, random_state=42))]), n_jobs=2, param_grid={'classifier__learning_rate': (0.01, 0.1, 1, 10), 'classifier__max_leaf_nodes': (3, 10, 30)})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=2, estimator=Pipeline(steps=[('preprocessor', ColumnTransformer(remainder='passthrough', transformers=[('cat_preprocessor', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country'])])), ('classifier', HistGradientBoostingClassifier(max_leaf_nodes=4, random_state=42))]), n_jobs=2, param_grid={'classifier__learning_rate': (0.01, 0.1, 1, 10), 'classifier__max_leaf_nodes': (3, 10, 30)})
Pipeline(steps=[('preprocessor', ColumnTransformer(remainder='passthrough', transformers=[('cat_preprocessor', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country'])])), ('classifier', HistGradientBoostingClassifier(max_leaf_nodes=30, random_state=42))])
ColumnTransformer(remainder='passthrough', transformers=[('cat_preprocessor', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country'])])
['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
['age', 'capital-gain', 'capital-loss', 'hours-per-week']
passthrough
HistGradientBoostingClassifier(max_leaf_nodes=30, random_state=42)
Finally, we check the accuracy of our model using the test set.
accuracy = model_grid_search.score(data_test, target_test)
print(
f"The test accuracy score of the grid-searched pipeline is: {accuracy:.2f}"
)
The test accuracy score of the grid-searched pipeline is: 0.88
Warning
Be aware that the evaluation should normally be performed through
cross-validation by providing model_grid_search
as a model to the
cross_validate
function.
Here, we used a single train-test split to to evaluate model_grid_search
. In
a future notebook will go into more detail about nested cross-validation, when
you use cross-validation both for hyperparameter tuning and model evaluation.
The GridSearchCV
estimator takes a param_grid
parameter which defines all
hyperparameters and their associated values. The grid-search is in charge
of creating all possible combinations and test them.
The number of combinations are equal to the product of the number of values to explore for each parameter (e.g. in our example 4 x 3 combinations). Thus, adding new parameters with their associated values to be explored become rapidly computationally expensive.
Once the grid-search is fitted, it can be used as any other predictor by
calling predict
and predict_proba
. Internally, it uses the model with the
best parameters found during fit
.
Get predictions for the 5 first samples using the estimator with the best parameters.
model_grid_search.predict(data_test.iloc[0:5])
array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' >50K'], dtype=object)
You can know about these parameters by looking at the best_params_
attribute.
print(f"The best set of parameters is: {model_grid_search.best_params_}")
The best set of parameters is: {'classifier__learning_rate': 0.1, 'classifier__max_leaf_nodes': 30}
The accuracy and the best parameters of the grid-searched pipeline are similar to the ones we found in the previous exercise, where we searched the best parameters βby handβ through a double for loop.
In addition, we can inspect all results which are stored in the attribute
cv_results_
of the grid-search. We filter some specific columns from these
results.
cv_results = pd.DataFrame(model_grid_search.cv_results_).sort_values(
"mean_test_score", ascending=False
)
cv_results
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_classifier__learning_rate | param_classifier__max_leaf_nodes | params | split0_test_score | split1_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 0.358894 | 0.034587 | 0.189584 | 0.009570 | 0.10 | 30 | {'classifier__learning_rate': 0.1, 'classifier... | 0.868912 | 0.867213 | 0.868063 | 0.000850 | 1 |
4 | 0.279563 | 0.004914 | 0.166795 | 0.001955 | 0.10 | 10 | {'classifier__learning_rate': 0.1, 'classifier... | 0.866783 | 0.866066 | 0.866425 | 0.000359 | 2 |
7 | 0.091758 | 0.000320 | 0.068008 | 0.005037 | 1.00 | 10 | {'classifier__learning_rate': 1, 'classifier__... | 0.854826 | 0.862899 | 0.858863 | 0.004036 | 3 |
6 | 0.117121 | 0.024804 | 0.074849 | 0.004973 | 1.00 | 3 | {'classifier__learning_rate': 1, 'classifier__... | 0.853844 | 0.860934 | 0.857389 | 0.003545 | 4 |
3 | 0.202173 | 0.002418 | 0.111300 | 0.001227 | 0.10 | 3 | {'classifier__learning_rate': 0.1, 'classifier... | 0.852752 | 0.853781 | 0.853266 | 0.000515 | 5 |
8 | 0.103435 | 0.006266 | 0.071747 | 0.001440 | 1.00 | 30 | {'classifier__learning_rate': 1, 'classifier__... | 0.853734 | 0.848321 | 0.851028 | 0.002707 | 6 |
2 | 0.448060 | 0.000558 | 0.216894 | 0.006915 | 0.01 | 30 | {'classifier__learning_rate': 0.01, 'classifie... | 0.840413 | 0.846246 | 0.843330 | 0.002917 | 7 |
1 | 0.288674 | 0.002739 | 0.168220 | 0.002892 | 0.01 | 10 | {'classifier__learning_rate': 0.01, 'classifie... | 0.818956 | 0.816708 | 0.817832 | 0.001124 | 8 |
0 | 0.205448 | 0.002558 | 0.113309 | 0.000597 | 0.01 | 3 | {'classifier__learning_rate': 0.01, 'classifie... | 0.797882 | 0.796451 | 0.797166 | 0.000715 | 9 |
10 | 0.064057 | 0.004787 | 0.052656 | 0.002797 | 10.00 | 10 | {'classifier__learning_rate': 10, 'classifier_... | 0.742356 | 0.493803 | 0.618080 | 0.124277 | 10 |
11 | 0.062735 | 0.000718 | 0.047181 | 0.001019 | 10.00 | 30 | {'classifier__learning_rate': 10, 'classifier_... | 0.759937 | 0.338739 | 0.549338 | 0.210599 | 11 |
9 | 0.057764 | 0.000119 | 0.046180 | 0.000701 | 10.00 | 3 | {'classifier__learning_rate': 10, 'classifier_... | 0.279701 | 0.287251 | 0.283476 | 0.003775 | 12 |
Let us focus on the most interesting columns and shorten the parameter names
to remove the "param_classifier__"
prefix for readability:
# get the parameter names
column_results = [f"param_{name}" for name in param_grid.keys()]
column_results += ["mean_test_score", "std_test_score", "rank_test_score"]
cv_results = cv_results[column_results]
def shorten_param(param_name):
if "__" in param_name:
return param_name.rsplit("__", 1)[1]
return param_name
cv_results = cv_results.rename(shorten_param, axis=1)
cv_results
learning_rate | max_leaf_nodes | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|
5 | 0.10 | 30 | 0.868063 | 0.000850 | 1 |
4 | 0.10 | 10 | 0.866425 | 0.000359 | 2 |
7 | 1.00 | 10 | 0.858863 | 0.004036 | 3 |
6 | 1.00 | 3 | 0.857389 | 0.003545 | 4 |
3 | 0.10 | 3 | 0.853266 | 0.000515 | 5 |
8 | 1.00 | 30 | 0.851028 | 0.002707 | 6 |
2 | 0.01 | 30 | 0.843330 | 0.002917 | 7 |
1 | 0.01 | 10 | 0.817832 | 0.001124 | 8 |
0 | 0.01 | 3 | 0.797166 | 0.000715 | 9 |
10 | 10.00 | 10 | 0.618080 | 0.124277 | 10 |
11 | 10.00 | 30 | 0.549338 | 0.210599 | 11 |
9 | 10.00 | 3 | 0.283476 | 0.003775 | 12 |
With only 2 parameters, we might want to visualize the grid-search as a
heatmap. We need to transform our cv_results
into a dataframe where:
the rows correspond to the learning-rate values;
the columns correspond to the maximum number of leaf;
the content of the dataframe is the mean test scores.
pivoted_cv_results = cv_results.pivot_table(
values="mean_test_score",
index=["learning_rate"],
columns=["max_leaf_nodes"],
)
pivoted_cv_results
max_leaf_nodes | 3 | 10 | 30 |
---|---|---|---|
learning_rate | |||
0.01 | 0.797166 | 0.817832 | 0.843330 |
0.10 | 0.853266 | 0.866425 | 0.868063 |
1.00 | 0.857389 | 0.858863 | 0.851028 |
10.00 | 0.283476 | 0.618080 | 0.549338 |
We can use a heatmap representation to show the above dataframe visually.
import seaborn as sns
ax = sns.heatmap(
pivoted_cv_results, annot=True, cmap="YlGnBu", vmin=0.7, vmax=0.9
)
ax.invert_yaxis()
The above tables highlights the following things:
for too high values of
learning_rate
, the generalization performance of the model is degraded and adjusting the value ofmax_leaf_nodes
cannot fix that problem;outside of this pathological region, we observe that the optimal choice of
max_leaf_nodes
depends on the value oflearning_rate
;in particular, we observe a βdiagonalβ of good models with an accuracy close to the maximal of 0.87: when the value of
max_leaf_nodes
is increased, one should decrease the value oflearning_rate
accordingly to preserve a good accuracy.
The precise meaning of those two parameters will be explained later.
For now we note that, in general, there is no unique optimal parameter setting: 4 models out of the 12 parameter configurations reach the maximal accuracy (up to small random fluctuations caused by the sampling of the training set).
In this notebook we have seen:
how to optimize the hyperparameters of a predictive model via a grid-search;
that searching for more than two hyperparamters is too costly;
that a grid-search does not necessarily find an optimal solution.