📃 Solution for Exercise M1.03#
The goal of this exercise is to compare the performance of our classifier in
the previous notebook (roughly 81% accuracy with LogisticRegression
) to some
simple baseline classifiers. The simplest baseline classifier is one that
always predicts the same class, irrespective of the input data.
What would be the score of a model that always predicts
' >50K'
?What would be the score of a model that always predicts
' <=50K'
?Is 81% or 82% accuracy a good score for this problem?
Use a DummyClassifier
and do a train-test split to evaluate its accuracy on
the test set. This
link
shows a few examples of how to evaluate the generalization performance of
these baseline models.
import pandas as pd
adult_census = pd.read_csv("../datasets/adult-census.csv")
We first split our dataset to have the target separated from the data used to train our predictive model.
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=target_name)
We start by selecting only the numerical columns as seen in the previous notebook.
numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]
data_numeric = data[numerical_columns]
Split the data and target into a train and test set.
from sklearn.model_selection import train_test_split
# solution
data_numeric_train, data_numeric_test, target_train, target_test = (
train_test_split(data_numeric, target, random_state=42)
)
Use a DummyClassifier
such that the resulting classifier always predict the
class ' >50K'
. What is the accuracy score on the test set? Repeat the
experiment by always predicting the class ' <=50K'
.
Hint: you can set the strategy
parameter of the DummyClassifier
to achieve
the desired behavior.
from sklearn.dummy import DummyClassifier
# solution
class_to_predict = " >50K"
high_revenue_clf = DummyClassifier(
strategy="constant", constant=class_to_predict
)
high_revenue_clf.fit(data_numeric_train, target_train)
score = high_revenue_clf.score(data_numeric_test, target_test)
print(f"Accuracy of a model predicting only high revenue: {score:.3f}")
Accuracy of a model predicting only high revenue: 0.234
We clearly see that the score is below 0.5 which might be surprising at first.
We now check the generalization performance of a model which always predict
the low revenue class, i.e. " <=50K"
.
class_to_predict = " <=50K"
low_revenue_clf = DummyClassifier(
strategy="constant", constant=class_to_predict
)
low_revenue_clf.fit(data_numeric_train, target_train)
score = low_revenue_clf.score(data_numeric_test, target_test)
print(f"Accuracy of a model predicting only low revenue: {score:.3f}")
Accuracy of a model predicting only low revenue: 0.766
We observe that this model has an accuracy higher than 0.5. This is due to the fact that we have 3/4 of the target belonging to low-revenue class.
Therefore, any predictive model giving results below this dummy classifier would not be helpful.
adult_census["class"].value_counts()
class
<=50K 37155
>50K 11687
Name: count, dtype: int64
(target == " <=50K").mean()
np.float64(0.7607182343065395)
In practice, we could have the strategy "most_frequent"
to predict the class
that appears the most in the training target.
most_freq_revenue_clf = DummyClassifier(strategy="most_frequent")
most_freq_revenue_clf.fit(data_numeric_train, target_train)
score = most_freq_revenue_clf.score(data_numeric_test, target_test)
print(f"Accuracy of a model predicting the most frequent class: {score:.3f}")
Accuracy of a model predicting the most frequent class: 0.766
So the LogisticRegression
accuracy (roughly 81%) seems better than the
DummyClassifier
accuracy (roughly 76%). In a way it is a bit reassuring,
using a machine learning model gives you a better performance than always
predicting the majority class, i.e. the low income class " <=50K"
.