Hyperparameter tuning#

In the previous section, we did not discuss the parameters of random forest and gradient-boosting. However, there are a couple of things to keep in mind when setting these.

This notebook gives crucial information regarding how to set the hyperparameters of both random forest and gradient boosting decision tree models.

Caution

For the sake of clarity, no cross-validation will be used to estimate the testing error. We are only showing the effect of the parameters on the validation set of what should be the inner cross-validation.

Random forest#

The main parameter to tune for random forest is the n_estimators parameter. In general, the more trees in the forest, the better the generalization performance will be. However, it will slow down the fitting and prediction time. The goal is to balance computing time and generalization performance when setting the number of estimators when putting such learner in production.

Then, we could also tune a parameter that controls the depth of each tree in the forest. Two parameters are important for this: max_depth and max_leaf_nodes. They differ in the way they control the tree structure. Indeed, max_depth will enforce to have a more symmetric tree, while max_leaf_nodes does not impose such constraint.

Be aware that with random forest, trees are generally deep since we are seeking to overfit each tree on each bootstrap sample because this will be mitigated by combining them altogether. Assembling underfitted trees (i.e. shallow trees) might also lead to an underfitted forest.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data, target = fetch_california_housing(return_X_y=True, as_frame=True)
target *= 100  # rescale the target in k$
data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=0)
import pandas as pd
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor

param_distributions = {
    "n_estimators": [1, 2, 5, 10, 20, 50, 100, 200, 500],
    "max_leaf_nodes": [2, 5, 10, 20, 50, 100],
}
search_cv = RandomizedSearchCV(
    RandomForestRegressor(n_jobs=2), param_distributions=param_distributions,
    scoring="neg_mean_absolute_error", n_iter=10, random_state=0, n_jobs=2,
)
search_cv.fit(data_train, target_train)

columns = [f"param_{name}" for name in param_distributions.keys()]
columns += ["mean_test_error", "std_test_error"]
cv_results = pd.DataFrame(search_cv.cv_results_)
cv_results["mean_test_error"] = -cv_results["mean_test_score"]
cv_results["std_test_error"] = cv_results["std_test_score"]
cv_results[columns].sort_values(by="mean_test_error")
param_n_estimators param_max_leaf_nodes mean_test_error std_test_error
0 500 100 40.838385 0.720018
2 10 100 41.589679 0.356633
7 100 50 43.960034 0.905163
8 1 100 47.017331 1.194361
6 50 20 49.449529 0.706529
1 100 20 49.462949 0.856184
9 10 20 49.817150 0.605263
3 500 10 54.595793 0.828441
4 5 5 61.449142 1.010583
5 5 2 73.621985 1.241656

We can observe in our search that we are required to have a large number of leaves and thus deep trees. This parameter seems particularly impactful in comparison to the number of trees for this particular dataset: with at least 50 trees, the generalization performance will be driven by the number of leaves.

Now we will estimate the generalization performance of the best model by refitting it with the full training set and using the test set for scoring on unseen data. This is done by default when calling the .fit method.

error = -search_cv.score(data_test, target_test)
print(f"On average, our random forest regressor makes an error of {error:.2f} k$")
On average, our random forest regressor makes an error of 41.98 k$

Gradient-boosting decision trees#

For gradient-boosting, parameters are coupled, so we cannot set the parameters one after the other anymore. The important parameters are n_estimators, learning_rate, and max_depth or max_leaf_nodes (as previously discussed random forest).

Let’s first discuss the max_depth (or max_leaf_nodes) parameter. We saw in the section on gradient-boosting that the algorithm fits the error of the previous tree in the ensemble. Thus, fitting fully grown trees would be detrimental. Indeed, the first tree of the ensemble would perfectly fit (overfit) the data and thus no subsequent tree would be required, since there would be no residuals. Therefore, the tree used in gradient-boosting should have a low depth, typically between 3 to 8 levels, or few leaves (\(2^3=8\) to \(2^8=256\)). Having very weak learners at each step will help reducing overfitting.

With this consideration in mind, the deeper the trees, the faster the residuals will be corrected and less learners are required. Therefore, n_estimators should be increased if max_depth is lower.

Finally, we have overlooked the impact of the learning_rate parameter until now. When fitting the residuals, we would like the tree to try to correct all possible errors or only a fraction of them. The learning-rate allows you to control this behaviour. A small learning-rate value would only correct the residuals of very few samples. If a large learning-rate is set (e.g., 1), we would fit the residuals of all samples. So, with a very low learning-rate, we will need more estimators to correct the overall error. However, a too large learning-rate tends to obtain an overfitted ensemble, similar to having a too large tree depth.

from scipy.stats import loguniform
from sklearn.ensemble import GradientBoostingRegressor

param_distributions = {
    "n_estimators": [1, 2, 5, 10, 20, 50, 100, 200, 500],
    "max_leaf_nodes": [2, 5, 10, 20, 50, 100],
    "learning_rate": loguniform(0.01, 1),
}
search_cv = RandomizedSearchCV(
    GradientBoostingRegressor(), param_distributions=param_distributions,
    scoring="neg_mean_absolute_error", n_iter=20, random_state=0, n_jobs=2
)
search_cv.fit(data_train, target_train)

columns = [f"param_{name}" for name in param_distributions.keys()]
columns += ["mean_test_error", "std_test_error"]
cv_results = pd.DataFrame(search_cv.cv_results_)
cv_results["mean_test_error"] = -cv_results["mean_test_score"]
cv_results["std_test_error"] = cv_results["std_test_score"]
cv_results[columns].sort_values(by="mean_test_error")
param_n_estimators param_max_leaf_nodes param_learning_rate mean_test_error std_test_error
1 200 20 0.160519 33.896344 0.431046
12 200 50 0.110585 34.790288 0.287173
17 500 5 0.771785 34.830802 0.569708
10 200 20 0.109889 35.004120 0.378880
6 500 100 0.709894 35.460436 0.325446
18 10 5 0.637819 42.535730 0.338066
3 500 2 0.07502 43.457866 0.704599
4 100 5 0.0351 46.558900 0.578629
19 5 20 0.202432 61.387176 0.610988
8 5 2 0.462636 65.114017 0.846987
9 10 5 0.088556 66.243538 0.720131
15 50 100 0.010904 71.847050 0.683326
5 2 2 0.421054 74.384704 0.791104
2 5 100 0.070357 77.007841 0.789595
16 2 50 0.167568 77.131005 0.850380
11 1 5 0.190477 82.819015 0.976351
13 5 20 0.033815 83.765509 0.974672
0 1 100 0.125207 85.363288 1.040982
14 1 10 0.081715 87.373374 1.071555
7 1 20 0.014937 90.531295 1.113892

Caution

Here, we tune the n_estimators but be aware that using early-stopping as in the previous exercise will be better.

In this search, we see that the learning_rate is required to be large enough, i.e. > 0.1. We also observe that for the best ranked models, having a smaller learning_rate, will require more trees or a larger number of leaves for each tree. However, it is particularly difficult to draw more detailed conclusions since the best value of an hyperparameter depends on the other hyperparameter values.

Now we estimate the generalization performance of the best model using the test set.

error = -search_cv.score(data_test, target_test)
print(f"On average, our GBDT regressor makes an error of {error:.2f} k$")
On average, our GBDT regressor makes an error of 33.01 k$

The mean test score in the held-out test set is slightly better than the score of the best model. The reason is that the final model is refitted on the whole training set and therefore, on more data than the inner cross-validated models of the grid search procedure.