🏁 Wrap-up quiz 6#
This quiz requires some programming to be answered.
This wrap-up quiz uses the penguins dataset, but notice that we do not use the
traditional Species
column as predictive target:
import pandas as pd
dataset = pd.read_csv("../datasets/penguins.csv")
feature_names = [
"Culmen Length (mm)",
"Culmen Depth (mm)",
"Flipper Length (mm)",
]
target_name = "Body Mass (g)"
dataset = dataset[feature_names + [target_name]].dropna(axis="rows", how="any")
dataset = dataset.sample(frac=1, random_state=0).reset_index(drop=True)
data, target = dataset[feature_names], dataset[target_name]
We therefore define our problem as a regression problem: we want to predict the body mass of a penguin given its culmen and flipper measurements.
Notice that we randomly shuffled the rows of the dataset after loading it
(dataset.sample(frac=1, random_state=0)
). The reason is to break a spurious
order-related statistical dependency that would otherwise cause trouble with the
naive cross-validation procedure we use in this notebook. The problem of
order-dependent samples will be discussed more in detail on the model evaluation
module and is outside of the scope of this quiz for now.
Now, evaluate the following tree-based models:
a decision tree regressor, i.e.
sklearn.tree.DecisionTreeRegressor
a random forest regressor, i.e.
sklearn.ensemble.RandomForestRegressor
Use the default hyper-parameter settings for both models. The only exception
is to pass random_state=0
for all models to be sure to recover the exact
same performance scores as the solutions to this quiz.
Evaluate the generalization performance of these models using a 10-fold cross-validation:
use
sklearn.model_selection.cross_validate
to run the cross-validation routineset the parameter
cv=10
to use a 10-fold cross-validation strategy. Store the training score of the cross-validation by setting the parameterreturn_train_score=True
in the functioncross_validate
as we will use it later on.
Question
By comparing the cross-validation test scores fold-to-fold, count the number of times a random forest is better than a single decision tree. Select the range which this number belongs to:
a) [0, 3]: the random forest model is substantially worse than the single decision tree regressor
b) [4, 6]: both models are almost equivalent
c) [7, 10]: the random forest model is substantially better than the single decision tree regressor
Select a single answer
Now, train and evaluate with the same cross-validation strategy a random forest with 5 decision trees and another containing 100 decision trees. Once again store the training score.
Question
By comparing the cross-validation test scores fold-to-fold, count the number of times a random forest with 100 decision trees is better than a random forest with 5 decision trees. Select the range which this number belongs to:
a) [0, 3]: the random forest model with 100 decision trees is substantially worse than the random forest model with 5 decision trees
b) [4, 6]: both models are almost equivalent
c) [7, 10]: the random forest model with 100 decision trees is substantially better than the random forest model with 5 decision trees
Select a single answer
Plot the validation curve of the n_estimators
parameters defined by:
import numpy as np
n_estimators = np.array([1, 2, 5, 10, 20, 50, 100, 200, 500, 1_000])
Question
Select the correct statements below.
a) the train score decreases when
n_estimators
become large (above 500 trees)b) the train score reaches a plateau when
n_estimators
become large (above 500 trees)c) the train score increases when
n_estimators
become large (above 500 trees)d) the test score decreases when
n_estimators
become large (above 500 trees)e) the test score reaches a plateau when
n_estimators
become large (above 500 trees)f) the test score increases when
n_estimators
become large (above 500 trees)
Select all answers that apply
Repeat the previous experiment but this time, instead of choosing the default
parameters for the random forest, set the parameter max_depth=5
and build
the validation curve.
Question
Comparing the validation curve (train and test scores) of the random forest with a full depth and the random forest with a limited depth, select the correct statements.
a) the test score of the random forest with a full depth is (almost) always better than the test score of the random forest with a limited depth
b) the train score of the random forest with a full depth is (almost) always better than the train score of the random forest with a limited depth
c) the gap between the train and test scores decreases when reducing the depth of the trees of the random forest
d) the gap between the train and test scores increases when reducing the depth of the trees of the random forest
Select all answers that apply
Let us now focus at the very beginning of the validation curves, and
consider the training score of a random forests with a single tree
while using the default max_depth=None
parameter setting:
rf_1_tree = RandomForestRegressor(n_estimators=1, random_state=0)
cv_results_tree = cross_validate(
rf_1_tree, data, target, cv=10, return_train_score=True
)
cv_results_tree["train_score"]
should return:
array([0.83120264, 0.83309064, 0.83195043, 0.84834224, 0.85790323,
0.86235297, 0.84791111, 0.85183089, 0.82241954, 0.85045978])
The fact that this single-tree Random Forest can never reach a perfect R2 score of 1.0 on the training can be surprising.
Indeed, if you we evaluate the training accuracy of the single
DecisionTreeRegressor
one gets perfect memorization of the
training data:
tree = DecisionTreeRegressor(random_state=0)
cv_results_tree = cross_validate(
tree, data, target, cv=10, return_train_score=True
)
cv_results_tree["train_score"]
which outputs the expected perfect score:
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
Question
From the following statements, select the one that explains that a single-tree random forest cannot achieve perfect training scores.
a) the single tree in the random forest is trained using a bootstrap of the training set and not the training set itself (because
bootstrap=True
by default)b) for a given feature, the single tree in the random forest uses random splits while the single decision tree uses the best split
c) the random forest automatically limits the depth of the single decision tree, which prevents overfitting
Select a single answer
Build a validation curve for a sklearn.ensemble.HistGradientBoostingRegressor
varying max_iter
as follows:
max_iters = np.array([1, 2, 5, 10, 20, 50, 100, 200, 500])
We recall that max_iter
corresponds to the number of trees in the boosted
model.
Plot the average train and test score for each value of max_iter
.
Question
Select the correct statements.
a) for a small number of trees (between 5 and 10 trees), the gradient boosting model behave like the random forest algorithm: the train scores are high while the test scores are not optimum
b) for a small number of trees (between 5 and 10 trees), the gradient boosting model behave differently to the random forest algorithm: both the train and test scores are small
c) with a large number of trees (> 100 trees) adding more trees in the ensemble causes the gradient boosting model overfit (increasing the gap between the train score and test score)
d) with a large number of trees (> 100 trees) adding more trees in the ensemble does not impact the generalization performance of the gradient boosting model
Select all answers that apply