📃 Solution for Exercise M3.01

📃 Solution for Exercise M3.01#

The goal is to write an exhaustive search to find the best parameters combination maximizing the model generalization performance.

Here we use a small subset of the Adult Census dataset to make the code faster to execute. Once your code works on the small subset, try to change train_size to a larger value (e.g. 0.8 for 80% instead of 20%).

import pandas as pd

from sklearn.model_selection import train_test_split

adult_census = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])

data_train, data_test, target_train, target_test = train_test_split(
    data, target, train_size=0.2, random_state=42
)

from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OrdinalEncoder

categorical_preprocessor = OrdinalEncoder(
    handle_unknown="use_encoded_value", unknown_value=-1
)
preprocessor = make_column_transformer(
    (categorical_preprocessor, selector(dtype_include=object)),
    remainder="passthrough",
)

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline

model = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("classifier", HistGradientBoostingClassifier(random_state=42)),
    ]
)

Use the previously defined model (called model) and using two nested for loops, make a search of the best combinations of the learning_rate and max_leaf_nodes parameters. In this regard, you need to train and test the model by setting the parameters. The evaluation of the model should be performed using cross_val_score on the training set. Use the following parameters search:

learning_rate for the values 0.01, 0.1, 1 and 10. This parameter controls the ability of a new tree to correct the error of the previous sequence of trees
max_leaf_nodes for the values 3, 10, 30. This parameter controls the depth of each tree.

# solution
from sklearn.model_selection import cross_val_score

learning_rate = [0.01, 0.1, 1, 10]
max_leaf_nodes = [3, 10, 30]

best_score = 0
best_params = {}
for lr in learning_rate:
    for mln in max_leaf_nodes:
        print(
            (
                f"Evaluating model with learning rate {lr:.3f}"
                f" and max leaf nodes {mln}... "
            ),
            end="",
        )
        model.set_params(
            classifier__learning_rate=lr, classifier__max_leaf_nodes=mln
        )
        scores = cross_val_score(model, data_train, target_train, cv=2)
        mean_score = scores.mean()
        print(f"score: {mean_score:.3f}")
        if mean_score > best_score:
            best_score = mean_score
            best_params = {"learning_rate": lr, "max_leaf_nodes": mln}
            print(f"Found new best model with score {best_score:.3f}!")

print(f"The best accuracy obtained is {best_score:.3f}")
print(f"The best parameters found are:\n {best_params}")

Evaluating model with learning rate 0.010 and max leaf nodes 3... score: 0.789
Found new best model with score 0.789!
Evaluating model with learning rate 0.010 and max leaf nodes 10... 

score: 0.813
Found new best model with score 0.813!
Evaluating model with learning rate 0.010 and max leaf nodes 30... 

score: 0.842
Found new best model with score 0.842!
Evaluating model with learning rate 0.100 and max leaf nodes 3... score: 0.847
Found new best model with score 0.847!
Evaluating model with learning rate 0.100 and max leaf nodes 10... 

score: 0.859
Found new best model with score 0.859!
Evaluating model with learning rate 0.100 and max leaf nodes 30... 

score: 0.857
Evaluating model with learning rate 1.000 and max leaf nodes 3... score: 0.852
Evaluating model with learning rate 1.000 and max leaf nodes 10... 

score: 0.833
Evaluating model with learning rate 1.000 and max leaf nodes 30... 

score: 0.828
Evaluating model with learning rate 10.000 and max leaf nodes 3... score: 0.288
Evaluating model with learning rate 10.000 and max leaf nodes 10... 

score: 0.480
Evaluating model with learning rate 10.000 and max leaf nodes 30... score: 0.639
The best accuracy obtained is 0.859
The best parameters found are:
 {'learning_rate': 0.1, 'max_leaf_nodes': 10}

Now use the test set to score the model using the best parameters that we found using cross-validation. You will have to refit the model over the full training set.

# solution
best_lr = best_params["learning_rate"]
best_mln = best_params["max_leaf_nodes"]

model.set_params(
    classifier__learning_rate=best_lr, classifier__max_leaf_nodes=best_mln
)
model.fit(data_train, target_train)
test_score = model.score(data_test, target_test)

print(f"Test score after the parameter tuning: {test_score:.3f}")

Test score after the parameter tuning: 0.870