🏁 Wrap-up quiz¢

This quiz requires some programming to be answered.

Open the dataset blood_transfusion.csv.

import pandas as pd

blood_transfusion = pd.read_csv("../datasets/blood_transfusion.csv")
data = blood_transfusion.drop(columns="Class")
target = blood_transfusion["Class"]

In this dataset, the column "Class" is the target vector containing the labels that our model should predict.

For all the questions below, make a cross-validation evaluation using a 10-fold cross-validation strategy.

Evaluate the performance of a sklearn.dummy.DummyClassifier that always predict the most frequent class seen during the training. Be aware that you can pass a list of score to compute in sklearn.model_selection.cross_validate by setting the parameter scoring.

Question

What the accuracy of this dummy classifier?

  • a) ~0.5

  • b) ~0.62

  • c) ~0.75

Select a single answer

Question

What the balanced accuracy of this dummy classifier?

  • a) ~0.5

  • b) ~0.62

  • c) ~0.75

Select a single answer

Replace the DummyClassifier by a sklearn.tree.DecisionTreeClassifier and check the statistical performance to answer the question below.

Question

Is a single decision classifier better than a dummy classifier, by an increase of at least 0.04 of the balanced accuracy?

  • a) Yes

  • b) No

Select a single answer

Evaluate the performance of a sklearn.ensemble.RandomForestClassifier using 300 trees.

Question

Is random forest better than a dummy classifier, by an increase of at least 0.04 of the balanced accuracy?

  • a) Yes

  • b) No

Select a single answer

Compare a sklearn.ensemble.GradientBoostingClassifier and a sklearn.ensemble.RandomForestClassifier with both 300 trees. To do so, repeat 10 times a 10-fold cross-validation by using the balanced accuracy as metric. For each of the ten try, compute the average of the cross-validation score for both models. Count how many times a model is better than the other.

Question

On average, is the gradient boosting better than the random forest?

  • a) Yes

  • b) No

  • c) Equivalent

Select a single answer

Evaluate the performance of a sklearn.ensemble.HistGradientBoostingClassifier. Enable early-stopping and add as many trees as needed.

Note: Be aware that you need a specific import when importing the HistGradientBoostingClassifier:

# explicitly require this experimental feature
from sklearn.experimental import enable_hist_gradient_boosting
# now you can import normally from ensemble
from sklearn.ensemble import HistGradientBoostingClassifier

Question

Is histogram gradient boosting a better classifier considering the mean of the cross-validation test score?

  • a) Histogram gradient boosting is the best estimator

  • b) Histogram gradient boosting is better than random forest by worse than the exact gradient boosting

  • c) Histogram gradient boosting is better than the exact gradient boosting but worse than the random forest

  • d) Histogram gradient boosting is the worst estimator

Select a single answer

Question

With the early stopping activated, how many trees on average the HistGradientBoostingClassifier needed to converge?

  • a) ~30

  • b) ~100

  • c) ~150

  • d) ~200

  • e) ~300

Select a single answer

Imbalanced-learn is an open-source library relying on scikit-learn and provides methods to deal with classification with imbalanced classes.

Here, we will be using the class imblearn.ensemble.BalancedBaggingClassifier to alleviate the issue of class imbalance.

Use the BalancedBaggingClassifier and pass an HistGradientBoostingClassifier as a base_estimator. Fix the hyperparameter n_estimators to 50.

Question

What is a BalancedBaggingClassifier?

  • a) Is a classifier that make sure that each tree leaves belong to the same depth level

  • b) Is a classifier that explicitly maximizes the balanced accuracy score

  • c) Equivalent to a sklearn.ensemble.BaggingClassifier with a resampling of each bootstrap sample to contain a many samples from each class.

Select a single answer

Question

Compared to the balanced accuracy of a HistGradientBoostingClassifier alone (computed in one of the previous questions), the balanced accuracy of the BalancedBaggingClassifier is:

  • a) Worse

  • b) Better

  • c) Equivalent

Select a single answer