π Wrap-up quiz 2#
This quiz requires some programming to be answered.
Open the dataset blood_transfusion.csv
with the following command:
import pandas as pd
blood_transfusion = pd.read_csv("../datasets/blood_transfusion.csv")
target_name = "Class"
data = blood_transfusion.drop(columns=target_name)
target = blood_transfusion[target_name]
blood_transfusion
is a pandas dataframe. The column βClassβ contains the
target variable.
Question
Select the correct answers from the following proposals.
a) The problem to be solved is a regression problem
b) The problem to be solved is a binary classification problem (exactly 2 possible classes)
c) The problem to be solved is a multiclass classification problem (more than 2 possible classes)
d) The proportions of the class counts are imbalanced: some classes have more than twice as many rows than others
Select all answers that apply
Hint: target.unique()
, and target.value_counts()
are methods
that are helpful to answer to this question.
Question
Using a
sklearn.dummy.DummyClassifier
and the strategy "most_frequent"
, what is the average of the accuracy scores
obtained by performing a 10-fold cross-validation?
a) ~25%
b) ~50%
c) ~75%
Select a single answer
Hint: You can check the documentation of sklearn.model_selection.cross_val_score
here
and sklearn.model_selection.cross_validate
here.
Question
Repeat the previous experiment but compute the balanced accuracy instead of
the accuracy score. Pass scoring="balanced_accuracy"
when calling
cross_validate
or cross_val_score
functions, the mean score is:
a) ~25%
b) ~50%
c) ~75%
Select a single answer
We will use a
sklearn.neighbors.KNeighborsClassifier
for the remainder of this quiz.
Question
Why is it relevant to add a preprocessing step to scale the data using a
StandardScaler
when working with a KNeighborsClassifier
?
a) faster to compute the list of neighbors on scaled data
b) k-nearest neighbors is based on computing some distances. Features need to be normalized to contribute approximately equally to the distance computation.
c) This is irrelevant. One could use k-nearest neighbors without normalizing the dataset and get a very similar cross-validation score.
Select a single answer
Create a scikit-learn pipeline (using
sklearn.pipeline.make_pipeline
)
where a StandardScaler
will be used to scale the data followed by a
KNeighborsClassifier
. Use the default hyperparameters.
Question
Inspect the parameters of the created pipeline. What is the value of K, the number of neighbors considered when predicting with the k-nearest neighbors.
a) 1
b) 3
c) 5
d) 8
e) 10
Select a single answer
Hint: You can use model.get_params()
to get the parameters of a scikit-learn
estimator.
Question
Set n_neighbors=1
in the previous model and evaluate it using a 10-fold
cross-validation. Use the balanced accuracy as a score. What can you say about
this model? Compare the average of the train and test scores to argument your
answer.
a) The model clearly underfits
b) The model generalizes
c) The model clearly overfits
Select a single answer
Hint: compute the average test score and the average train score and compare
them. Make sure to pass return_train_score=True
to the cross_validate
function to also compute the train score.
We now study the effect of the parameter n_neighbors
on the train and test
score using a validation curve. You can use the following parameter range:
import numpy as np
param_range = np.array([1, 2, 5, 10, 20, 50, 100, 200, 500])
Also, use a 5-fold cross-validation and compute the balanced accuracy score
instead of the default accuracy score (check the scoring
parameter). Finally,
plot the average train and test scores for the different value of the
hyperparameter. We recall that the name of the parameter can be found using
model.get_params()
.
Question
Select the true affirmations stated below:
a) The model underfits for a range of
n_neighbors
values between 1 to 10b) The model underfits for a range of
n_neighbors
values between 10 to 100c) The model underfits for a range of
n_neighbors
values between 100 to 500
Select a single answer
Question
Select the most correct of the affirmations stated below:
a) The model overfits for a range of
n_neighbors
values between 1 to 10b) The model overfits for a range of
n_neighbors
values between 10 to 100c) The model overfits for a range of
n_neighbors
values between 100 to 500
Select a single answer
Question
Select the most correct of the affirmations stated below:
a) The model best generalizes for a range of
n_neighbors
values between 1 to 10b) The model best generalizes for a range of
n_neighbors
values between 10 to 100c) The model best generalizes for a range of
n_neighbors
values between 100 to 500
Select a single answer