logo

Scikit-learn course

  • Introduction

Machine Learning Concepts

  • πŸŽ₯ Introducing machine-learning concepts
  • βœ… Quiz Intro.01

The predictive modeling pipeline

  • Module overview
  • Tabular data exploration
    • First look at our dataset
    • πŸ“ Exercise M1.01
    • πŸ“ƒ Solution for Exercise M1.01
    • βœ… Quiz M1.01
  • Fitting a scikit-learn model on numerical data
    • First model with scikit-learn
    • πŸ“ Exercise M1.02
    • πŸ“ƒ Solution for Exercise M1.02
    • Working with numerical data
    • πŸ“ Exercise M1.03
    • πŸ“ƒ Solution for Exercise M1.03
    • Preprocessing for numerical features
    • πŸŽ₯ Validation of a model
    • Model evaluation using cross-validation
    • βœ… Quiz M1.02
  • Handling categorical data
    • Encoding of categorical variables
    • πŸ“ Exercise M1.04
    • πŸ“ƒ Solution for Exercise M1.04
    • Using numerical and categorical variables together
    • πŸ“ Exercise M1.05
    • πŸ“ƒ Solution for Exercise M1.05
    • πŸŽ₯ Visualizing scikit-learn pipelines in Jupyter
    • Visualizing scikit-learn pipelines in Jupyter
    • βœ… Quiz M1.03
  • 🏁 Wrap-up quiz 1
  • Main take-away

Selecting the best model

  • Module overview
  • Overfitting and underfitting
    • πŸŽ₯ Overfitting and Underfitting
    • Cross-validation framework
    • βœ… Quiz M2.01
  • Validation and learning curves
    • πŸŽ₯ Comparing train and test errors
    • Overfit-generalization-underfit
    • Effect of the sample size in cross-validation
    • πŸ“ Exercise M2.01
    • πŸ“ƒ Solution for Exercise M2.01
    • βœ… Quiz M2.02
  • Bias versus variance trade-off
    • πŸŽ₯ Bias versus Variance
    • βœ… Quiz M2.03
  • 🏁 Wrap-up quiz 2
  • Main take-away

Hyperparameter tuning

  • Module overview
  • Manual tuning
    • Set and get hyperparameters in scikit-learn
    • πŸ“ Exercise M3.01
    • πŸ“ƒ Solution for Exercise M3.01
    • βœ… Quiz M3.01
  • Automated tuning
    • Hyperparameter tuning by grid-search
    • Hyperparameter tuning by randomized-search
    • πŸŽ₯ Analysis of hyperparameter search results
    • Analysis of hyperparameter search results
    • Evaluation and hyperparameter tuning
    • πŸ“ Exercise M3.02
    • πŸ“ƒ Solution for Exercise M3.02
    • βœ… Quiz M3.02
  • 🏁 Wrap-up quiz 3
  • Main take-away

Linear models

  • Module overview
  • Intuitions on linear models
    • πŸŽ₯ Intuitions on linear models
    • βœ… Quiz M4.01
  • Linear regression
    • Linear regression without scikit-learn
    • πŸ“ Exercise M4.01
    • πŸ“ƒ Solution for Exercise M4.01
    • Linear regression using scikit-learn
    • βœ… Quiz M4.02
  • Modelling non-linear features-target relationships
    • πŸ“ Exercise M4.02
    • πŸ“ƒ Solution for Exercise M4.02
    • Linear regression for a non-linear features-target relationship
    • πŸ“ Exercise M4.03
    • πŸ“ƒ Solution for Exercise M4.03
    • βœ… Quiz M4.03
  • Regularization in linear model
    • πŸŽ₯ Intuitions on regularized linear models
    • Regularization of linear regression model
    • πŸ“ Exercise M4.04
    • πŸ“ƒ Solution for Exercise M4.04
    • βœ… Quiz M4.04
  • Linear model for classification
    • Linear model for classification
    • πŸ“ Exercise M4.05
    • πŸ“ƒ Solution for Exercise M4.05
    • Beyond linear separation in classification
    • βœ… Quiz M4.05
  • 🏁 Wrap-up quiz 4
  • Main take-away

Decision tree models

  • Module overview
  • Intuitions on tree-based models
    • πŸŽ₯ Intuitions on tree-based models
    • βœ… Quiz M5.01
  • Decision tree in classification
    • Build a classification decision tree
    • πŸ“ Exercise M5.01
    • πŸ“ƒ Solution for Exercise M5.01
    • βœ… Quiz M5.02
  • Decision tree in regression
    • Decision tree for regression
    • πŸ“ Exercise M5.02
    • πŸ“ƒ Solution for Exercise M5.02
    • βœ… Quiz M5.03
  • Hyperparameters of decision tree
    • Importance of decision tree hyperparameters on generalization
    • βœ… Quiz M5.04
  • 🏁 Wrap-up quiz 5
  • Main take-away

Ensemble of models

  • Module overview
  • Ensemble method using bootstrapping
    • πŸŽ₯ Intuitions on ensemble models: bagging
    • Introductory example to ensemble models
    • Bagging
    • πŸ“ Exercise M6.01
    • πŸ“ƒ Solution for Exercise M6.01
    • Random forests
    • πŸ“ Exercise M6.02
    • πŸ“ƒ Solution for Exercise M6.02
    • βœ… Quiz M6.01
  • Ensemble based on boosting
    • πŸŽ₯ Intuitions on ensemble models: boosting
    • Adaptive Boosting (AdaBoost)
    • Gradient-boosting decision tree (GBDT)
    • πŸ“ Exercise M6.03
    • πŸ“ƒ Solution for Exercise M6.03
    • Speeding-up gradient-boosting
    • βœ… Quiz M6.02
  • Hyperparameter tuning with ensemble methods
    • Hyperparameter tuning
    • πŸ“ Exercise M6.04
    • πŸ“ƒ Solution for Exercise M6.04
    • βœ… Quiz M6.03
  • 🏁 Wrap-up quiz 6
  • Main take-away

Evaluating model performance

  • Module overview
  • Comparing a model with simple baselines
    • Comparing model performance with a simple baseline
    • πŸ“ Exercise M7.01
    • πŸ“ƒ Solution for Exercise M7.01
    • βœ… Quiz M7.01
  • Choice of cross-validation
    • Stratification
    • Sample grouping
    • Non i.i.d. data
    • βœ… Quiz M7.02
  • Nested cross-validation
    • Nested cross-validation
    • βœ… Quiz M7.03
  • Classification metrics
    • Classification
    • πŸ“ Exercise M7.02
    • πŸ“ƒ Solution for Exercise M7.02
    • βœ… Quiz M7.04
  • Regression metrics
    • Regression
    • πŸ“ Exercise M7.03
    • πŸ“ƒ Solution for Exercise M7.03
    • βœ… Quiz M7.05
  • 🏁 Wrap-up quiz 7
  • Main take-away

Concluding remarks

  • πŸŽ₯ Concluding remarks
  • Concluding remarks

Appendix

  • Glossary
  • Datasets description
    • The penguins datasets
    • The adult census dataset
    • The California housing dataset
    • The Ames housing dataset
    • The blood transfusion dataset
    • The bike rides dataset
  • Acknowledgement
  • Notebook timings
  • Table of contents

🚧 Feature selection

  • Module overview
  • Benefits of using feature selection
  • Caveats of feature selection
    • πŸ“ Exercise 01
    • πŸ“ƒ Solution for Exercise 01
    • Limitation of selecting feature using a model
  • Main take-away
  • βœ… Quiz

🚧 Interpretation

  • Feature importance
  • βœ… Quiz
Powered by Jupyter Book
  • Binder
  • repository
  • open issue
  • suggest edit
  • .py
Contents
  • Our predictive model
  • Tuning using a grid-search

Hyperparameter tuning by grid-search

Contents

  • Our predictive model
  • Tuning using a grid-search

Hyperparameter tuning by grid-search#

In the previous notebook, we saw that hyperparameters can affect the generalization performance of a model. In this notebook, we will show how to optimize hyperparameters using a grid-search approach.

Our predictive model#

Let us reload the dataset as we did previously:

from sklearn import set_config

set_config(display="diagram")
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

We extract the column containing the target.

target_name = "class"
target = adult_census[target_name]
target
0         <=50K
1         <=50K
2          >50K
3          >50K
4         <=50K
          ...  
48837     <=50K
48838      >50K
48839     <=50K
48840     <=50K
48841      >50K
Name: class, Length: 48842, dtype: object

We drop from our data the target and the "education-num" column which duplicates the information from the "education" column.

data = adult_census.drop(columns=[target_name, "education-num"])
data.head()
age workclass education marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
0 25 Private 11th Never-married Machine-op-inspct Own-child Black Male 0 0 40 United-States
1 38 Private HS-grad Married-civ-spouse Farming-fishing Husband White Male 0 0 50 United-States
2 28 Local-gov Assoc-acdm Married-civ-spouse Protective-serv Husband White Male 0 0 40 United-States
3 44 Private Some-college Married-civ-spouse Machine-op-inspct Husband Black Male 7688 0 40 United-States
4 18 ? Some-college Never-married ? Own-child White Female 0 0 30 United-States

Once the dataset is loaded, we split it into a training and testing sets.

from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42)

We will define a pipeline as seen in the first module. It will handle both numerical and categorical features.

The first step is to select all the categorical columns.

from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)

Here we will use a tree-based model as a classifier (i.e. HistGradientBoostingClassifier). That means:

  • Numerical variables don’t need scaling;

  • Categorical variables can be dealt with an OrdinalEncoder even if the coding order is not meaningful;

  • For tree-based models, the OrdinalEncoder avoids having high-dimensional representations.

We now build our OrdinalEncoder by passing it the known categories.

from sklearn.preprocessing import OrdinalEncoder

categorical_preprocessor = OrdinalEncoder(handle_unknown="use_encoded_value",
                                          unknown_value=-1)

We then use a ColumnTransformer to select the categorical columns and apply the OrdinalEncoder to them.

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('cat_preprocessor', categorical_preprocessor, categorical_columns)],
    remainder='passthrough', sparse_threshold=0)

Finally, we use a tree-based classifier (i.e. histogram gradient-boosting) to predict whether or not a person earns more than 50 k$ a year.

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline

model = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier",
     HistGradientBoostingClassifier(random_state=42, max_leaf_nodes=4))])
model
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough', sparse_threshold=0,
                                   transformers=[('cat_preprocessor',
                                                  OrdinalEncoder(handle_unknown='use_encoded_value',
                                                                 unknown_value=-1),
                                                  ['workclass', 'education',
                                                   'marital-status',
                                                   'occupation', 'relationship',
                                                   'race', 'sex',
                                                   'native-country'])])),
                ('classifier',
                 HistGradientBoostingClassifier(max_leaf_nodes=4,
                                                random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough', sparse_threshold=0,
                                   transformers=[('cat_preprocessor',
                                                  OrdinalEncoder(handle_unknown='use_encoded_value',
                                                                 unknown_value=-1),
                                                  ['workclass', 'education',
                                                   'marital-status',
                                                   'occupation', 'relationship',
                                                   'race', 'sex',
                                                   'native-country'])])),
                ('classifier',
                 HistGradientBoostingClassifier(max_leaf_nodes=4,
                                                random_state=42))])
ColumnTransformer(remainder='passthrough', sparse_threshold=0,
                  transformers=[('cat_preprocessor',
                                 OrdinalEncoder(handle_unknown='use_encoded_value',
                                                unknown_value=-1),
                                 ['workclass', 'education', 'marital-status',
                                  'occupation', 'relationship', 'race', 'sex',
                                  'native-country'])])
['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
passthrough
HistGradientBoostingClassifier(max_leaf_nodes=4, random_state=42)

Tuning using a grid-search#

In the previous exercise we used one for loop for each hyperparameter to find the best combination over a fixed grid of values. GridSearchCV is a scikit-learn class that implements a very similar logic with less repetitive code.

Let’s see how to use the GridSearchCV estimator for doing such search. Since the grid-search will be costly, we will only explore the combination learning-rate and the maximum number of nodes.

%%time
from sklearn.model_selection import GridSearchCV

param_grid = {
    'classifier__learning_rate': (0.01, 0.1, 1, 10),
    'classifier__max_leaf_nodes': (3, 10, 30)}
model_grid_search = GridSearchCV(model, param_grid=param_grid,
                                 n_jobs=2, cv=2)
model_grid_search.fit(data_train, target_train)
CPU times: user 1.51 s, sys: 62.1 ms, total: 1.57 s
Wall time: 8.62 s
GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(remainder='passthrough',
                                                          sparse_threshold=0,
                                                          transformers=[('cat_preprocessor',
                                                                         OrdinalEncoder(handle_unknown='use_encoded_value',
                                                                                        unknown_value=-1),
                                                                         ['workclass',
                                                                          'education',
                                                                          'marital-status',
                                                                          'occupation',
                                                                          'relationship',
                                                                          'race',
                                                                          'sex',
                                                                          'native-country'])])),
                                       ('classifier',
                                        HistGradientBoostingClassifier(max_leaf_nodes=4,
                                                                       random_state=42))]),
             n_jobs=2,
             param_grid={'classifier__learning_rate': (0.01, 0.1, 1, 10),
                         'classifier__max_leaf_nodes': (3, 10, 30)})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(remainder='passthrough',
                                                          sparse_threshold=0,
                                                          transformers=[('cat_preprocessor',
                                                                         OrdinalEncoder(handle_unknown='use_encoded_value',
                                                                                        unknown_value=-1),
                                                                         ['workclass',
                                                                          'education',
                                                                          'marital-status',
                                                                          'occupation',
                                                                          'relationship',
                                                                          'race',
                                                                          'sex',
                                                                          'native-country'])])),
                                       ('classifier',
                                        HistGradientBoostingClassifier(max_leaf_nodes=4,
                                                                       random_state=42))]),
             n_jobs=2,
             param_grid={'classifier__learning_rate': (0.01, 0.1, 1, 10),
                         'classifier__max_leaf_nodes': (3, 10, 30)})
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough', sparse_threshold=0,
                                   transformers=[('cat_preprocessor',
                                                  OrdinalEncoder(handle_unknown='use_encoded_value',
                                                                 unknown_value=-1),
                                                  ['workclass', 'education',
                                                   'marital-status',
                                                   'occupation', 'relationship',
                                                   'race', 'sex',
                                                   'native-country'])])),
                ('classifier',
                 HistGradientBoostingClassifier(max_leaf_nodes=4,
                                                random_state=42))])
ColumnTransformer(remainder='passthrough', sparse_threshold=0,
                  transformers=[('cat_preprocessor',
                                 OrdinalEncoder(handle_unknown='use_encoded_value',
                                                unknown_value=-1),
                                 ['workclass', 'education', 'marital-status',
                                  'occupation', 'relationship', 'race', 'sex',
                                  'native-country'])])
['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
passthrough
HistGradientBoostingClassifier(max_leaf_nodes=4, random_state=42)

Finally, we will check the accuracy of our model using the test set.

accuracy = model_grid_search.score(data_test, target_test)
print(
    f"The test accuracy score of the grid-searched pipeline is: "
    f"{accuracy:.2f}"
)
The test accuracy score of the grid-searched pipeline is: 0.88

Warning

Be aware that the evaluation should normally be performed through cross-validation by providing model_grid_search as a model to the cross_validate function.

Here, we used a single train-test split to to evaluate model_grid_search. In a future notebook will go into more detail about nested cross-validation, when you use cross-validation both for hyperparameter tuning and model evaluation.

The GridSearchCV estimator takes a param_grid parameter which defines all hyperparameters and their associated values. The grid-search will be in charge of creating all possible combinations and test them.

The number of combinations will be equal to the product of the number of values to explore for each parameter (e.g. in our example 4 x 3 combinations). Thus, adding new parameters with their associated values to be explored become rapidly computationally expensive.

Once the grid-search is fitted, it can be used as any other predictor by calling predict and predict_proba. Internally, it will use the model with the best parameters found during fit.

Get predictions for the 5 first samples using the estimator with the best parameters.

model_grid_search.predict(data_test.iloc[0:5])
array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' >50K'], dtype=object)

You can know about these parameters by looking at the best_params_ attribute.

print(f"The best set of parameters is: "
      f"{model_grid_search.best_params_}")
The best set of parameters is: {'classifier__learning_rate': 0.1, 'classifier__max_leaf_nodes': 30}

The accuracy and the best parameters of the grid-searched pipeline are similar to the ones we found in the previous exercise, where we searched the best parameters β€œby hand” through a double for loop.

In addition, we can inspect all results which are stored in the attribute cv_results_ of the grid-search. We will filter some specific columns from these results.

cv_results = pd.DataFrame(model_grid_search.cv_results_).sort_values(
    "mean_test_score", ascending=False)
cv_results.head()
mean_fit_time std_fit_time mean_score_time std_score_time param_classifier__learning_rate param_classifier__max_leaf_nodes params split0_test_score split1_test_score mean_test_score std_test_score rank_test_score
5 0.659618 0.051242 0.259914 0.009164 0.1 30 {'classifier__learning_rate': 0.1, 'classifier... 0.868912 0.867213 0.868063 0.000850 1
4 0.471657 0.009687 0.240315 0.005598 0.1 10 {'classifier__learning_rate': 0.1, 'classifier... 0.866783 0.866066 0.866425 0.000359 2
7 0.160865 0.001830 0.093805 0.002656 1 10 {'classifier__learning_rate': 1, 'classifier__... 0.858648 0.862408 0.860528 0.001880 3
6 0.156626 0.007163 0.096688 0.001224 1 3 {'classifier__learning_rate': 1, 'classifier__... 0.859358 0.859514 0.859436 0.000078 4
8 0.164298 0.010579 0.091080 0.002076 1 30 {'classifier__learning_rate': 1, 'classifier__... 0.855536 0.856129 0.855832 0.000296 5

Let us focus on the most interesting columns and shorten the parameter names to remove the "param_classifier__" prefix for readability:

# get the parameter names
column_results = [f"param_{name}" for name in param_grid.keys()]
column_results += [
    "mean_test_score", "std_test_score", "rank_test_score"]
cv_results = cv_results[column_results]
def shorten_param(param_name):
    if "__" in param_name:
        return param_name.rsplit("__", 1)[1]
    return param_name


cv_results = cv_results.rename(shorten_param, axis=1)
cv_results
learning_rate max_leaf_nodes mean_test_score std_test_score rank_test_score
5 0.1 30 0.868063 0.000850 1
4 0.1 10 0.866425 0.000359 2
7 1 10 0.860528 0.001880 3
6 1 3 0.859436 0.000078 4
8 1 30 0.855832 0.000296 5
3 0.1 3 0.853266 0.000515 6
2 0.01 30 0.843330 0.002917 7
1 0.01 10 0.817832 0.001124 8
0 0.01 3 0.797166 0.000715 9
11 10 30 0.288200 0.050539 10
9 10 3 0.283476 0.003775 11
10 10 10 0.262564 0.006326 12

With only 2 parameters, we might want to visualize the grid-search as a heatmap. We need to transform our cv_results into a dataframe where:

  • the rows will correspond to the learning-rate values;

  • the columns will correspond to the maximum number of leaf;

  • the content of the dataframe will be the mean test scores.

pivoted_cv_results = cv_results.pivot_table(
    values="mean_test_score", index=["learning_rate"],
    columns=["max_leaf_nodes"])

pivoted_cv_results
max_leaf_nodes 3 10 30
learning_rate
0.01 0.797166 0.817832 0.843330
0.10 0.853266 0.866425 0.868063
1.00 0.859436 0.860528 0.855832
10.00 0.283476 0.262564 0.288200

We can use a heatmap representation to show the above dataframe visually.

import seaborn as sns

ax = sns.heatmap(pivoted_cv_results, annot=True, cmap="YlGnBu", vmin=0.7,
                 vmax=0.9)
ax.invert_yaxis()
../_images/parameter_tuning_grid_search_35_0.png

The above tables highlights the following things:

  • for too high values of learning_rate, the generalization performance of the model is degraded and adjusting the value of max_leaf_nodes cannot fix that problem;

  • outside of this pathological region, we observe that the optimal choice of max_leaf_nodes depends on the value of learning_rate;

  • in particular, we observe a β€œdiagonal” of good models with an accuracy close to the maximal of 0.87: when the value of max_leaf_nodes is increased, one should decrease the value of learning_rate accordingly to preserve a good accuracy.

The precise meaning of those two parameters will be explained later.

For now we will note that, in general, there is no unique optimal parameter setting: 4 models out of the 12 parameter configurations reach the maximal accuracy (up to small random fluctuations caused by the sampling of the training set).

In this notebook we have seen:

  • how to optimize the hyperparameters of a predictive model via a grid-search;

  • that searching for more than two hyperparamters is too costly;

  • that a grid-search does not necessarily find an optimal solution.

previous

Automated tuning

next

Hyperparameter tuning by randomized-search

By scikit-learn developers
© Copyright 2022.

Join the full MOOC for better learning!
Brought to you by Inria Learning Lab, scikit-learn @ La Fondation Inria, Inria Academy, with many thanks to the scikit-learn community as a whole!