πŸ“ƒ Solution for Exercise M7.02ΒΆ

We presented different classification metrics in the previous notebook. However, we did not use it with a cross-validation. This exercise aims at practicing and implementing cross-validation.

We will reuse the blood transfusion dataset.

import pandas as pd

blood_transfusion = pd.read_csv("../datasets/blood_transfusion.csv")
data = blood_transfusion.drop(columns="Class")
target = blood_transfusion["Class"]

Note

If you want a deeper overview regarding this dataset, you can refer to the Appendix - Datasets description section at the end of this MOOC.

First, create a decision tree classifier.

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier()

Create a StratifiedKFold cross-validation object. Then use it inside the cross_val_score function to evaluate the decision tree. We will first use the accuracy as a score function. Explicitly use the scoring parameter of cross_val_score to compute the accuracy (even if this is the default score). Check its documentation to learn how to do that.

from sklearn.model_selection import cross_val_score, StratifiedKFold

cv = StratifiedKFold(n_splits=10)
scores = cross_val_score(tree, data, target, cv=cv, scoring="accuracy")
print(f"Accuracy score: {scores.mean():.3f} +/- {scores.std():.3f}")
Accuracy score: 0.623 +/- 0.149

Repeat the experiment by computing the balanced_accuracy.

scores = cross_val_score(tree, data, target, cv=cv,
                         scoring="balanced_accuracy")
print(f"Balanced accuracy score: {scores.mean():.3f} +/- {scores.std():.3f}")
Balanced accuracy score: 0.506 +/- 0.101

We will now add a bit of complexity. We would like to compute the precision of our model. However, during the course we saw that we need to mention the positive label which in our case we consider to be the class donated.

We will show that computing the precision without providing the positive label will not be supported by scikit-learn because it is indeed ambiguous.

try:
    scores = cross_val_score(tree, data, target, cv=cv, scoring="precision")
except ValueError as exc:
    print(exc)
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:687: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 88, in __call__
    *args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 243, in _score
    **self._kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1659, in precision_score
    zero_division=zero_division)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1462, in precision_recall_fscore_support
    pos_label)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1283, in _check_set_wise_labels
    f"pos_label={pos_label} is not a valid label. It "
ValueError: pos_label=1 is not a valid label. It should be one of ['donated', 'not donated']

  UserWarning,
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:687: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 88, in __call__
    *args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 243, in _score
    **self._kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1659, in precision_score
    zero_division=zero_division)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1462, in precision_recall_fscore_support
    pos_label)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1283, in _check_set_wise_labels
    f"pos_label={pos_label} is not a valid label. It "
ValueError: pos_label=1 is not a valid label. It should be one of ['donated', 'not donated']

  UserWarning,
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:687: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 88, in __call__
    *args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 243, in _score
    **self._kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1659, in precision_score
    zero_division=zero_division)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1462, in precision_recall_fscore_support
    pos_label)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1283, in _check_set_wise_labels
    f"pos_label={pos_label} is not a valid label. It "
ValueError: pos_label=1 is not a valid label. It should be one of ['donated', 'not donated']

  UserWarning,
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:687: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 88, in __call__
    *args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 243, in _score
    **self._kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1659, in precision_score
    zero_division=zero_division)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1462, in precision_recall_fscore_support
    pos_label)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1283, in _check_set_wise_labels
    f"pos_label={pos_label} is not a valid label. It "
ValueError: pos_label=1 is not a valid label. It should be one of ['donated', 'not donated']

  UserWarning,
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:687: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 88, in __call__
    *args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 243, in _score
    **self._kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1659, in precision_score
    zero_division=zero_division)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1462, in precision_recall_fscore_support
    pos_label)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1283, in _check_set_wise_labels
    f"pos_label={pos_label} is not a valid label. It "
ValueError: pos_label=1 is not a valid label. It should be one of ['donated', 'not donated']

  UserWarning,
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:687: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 88, in __call__
    *args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 243, in _score
    **self._kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1659, in precision_score
    zero_division=zero_division)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1462, in precision_recall_fscore_support
    pos_label)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1283, in _check_set_wise_labels
    f"pos_label={pos_label} is not a valid label. It "
ValueError: pos_label=1 is not a valid label. It should be one of ['donated', 'not donated']

  UserWarning,
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:687: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 88, in __call__
    *args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 243, in _score
    **self._kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1659, in precision_score
    zero_division=zero_division)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1462, in precision_recall_fscore_support
    pos_label)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1283, in _check_set_wise_labels
    f"pos_label={pos_label} is not a valid label. It "
ValueError: pos_label=1 is not a valid label. It should be one of ['donated', 'not donated']

  UserWarning,
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:687: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 88, in __call__
    *args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 243, in _score
    **self._kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1659, in precision_score
    zero_division=zero_division)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1462, in precision_recall_fscore_support
    pos_label)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1283, in _check_set_wise_labels
    f"pos_label={pos_label} is not a valid label. It "
ValueError: pos_label=1 is not a valid label. It should be one of ['donated', 'not donated']

  UserWarning,
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:687: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 88, in __call__
    *args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 243, in _score
    **self._kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1659, in precision_score
    zero_division=zero_division)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1462, in precision_recall_fscore_support
    pos_label)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1283, in _check_set_wise_labels
    f"pos_label={pos_label} is not a valid label. It "
ValueError: pos_label=1 is not a valid label. It should be one of ['donated', 'not donated']

  UserWarning,
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:687: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 88, in __call__
    *args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 243, in _score
    **self._kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1659, in precision_score
    zero_division=zero_division)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1462, in precision_recall_fscore_support
    pos_label)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/sklearn/metrics/_classification.py", line 1283, in _check_set_wise_labels
    f"pos_label={pos_label} is not a valid label. It "
ValueError: pos_label=1 is not a valid label. It should be one of ['donated', 'not donated']

  UserWarning,

Tip

We catch the exception with a try/except pattern to be able to print it.

We get an exception because the default scorer has its positive label set to one (pos_label=1), which is not our case (our positive label is β€œdonated”). In this case, we need to create a scorer using the scoring function and the helper function make_scorer.

So, import sklearn.metrics.make_scorer and sklearn.metrics.precision_score. Check their documentations for more information. precision_score and pass the extra parameter pos_label="donated".

from sklearn.metrics import make_scorer, precision_score

precision = make_scorer(precision_score, pos_label="donated")

Now, instead of providing the string "precision" to the scoring parameter in the cross_val_score call, pass the scorer that you created above.

scores = cross_val_score(tree, data, target, cv=cv, scoring=precision)
print(f"Precision score: {scores.mean():.3f} +/- {scores.std():.3f}")
Precision score: 0.252 +/- 0.167

cross_val_score will only compute a single score provided to the scoring parameter. The function cross_validate allows the computation of multiple scores by passing a list of string or scorer to the parameter scoring, which could be handy.

Import sklearn.model_selection.cross_validate and compute the accuracy and balanced accuracy through cross-validation. Plot the cross-validation score for both metrics using a box plot.

from sklearn.model_selection import cross_validate
scoring = ["accuracy", "balanced_accuracy"]

scores = cross_validate(tree, data, target, cv=cv, scoring=scoring)
scores
{'fit_time': array([0.00376558, 0.00354266, 0.00359154, 0.00333548, 0.00360227,
        0.00415683, 0.00343227, 0.00400114, 0.00349331, 0.00343752]),
 'score_time': array([0.00347281, 0.00265384, 0.00294352, 0.00295949, 0.00264406,
        0.00289726, 0.00299859, 0.00263143, 0.00270057, 0.00291276]),
 'test_accuracy': array([0.28      , 0.50666667, 0.78666667, 0.57333333, 0.65333333,
        0.64      , 0.68      , 0.76      , 0.66216216, 0.75675676]),
 'test_balanced_accuracy': array([0.39327485, 0.46637427, 0.68859649, 0.41520468, 0.48684211,
        0.42105263, 0.54239766, 0.72807018, 0.5123839 , 0.51186791])}
import pandas as pd

color = {"whiskers": "black", "medians": "black", "caps": "black"}

metrics = pd.DataFrame(
    [scores["test_accuracy"], scores["test_balanced_accuracy"]],
    index=["Accuracy", "Balanced accuracy"]
).T
import matplotlib.pyplot as plt

metrics.plot.box(vert=False, color=color)
_ = plt.title("Computation of multiple scores using cross_validate")
../_images/metrics_sol_01_18_0.png