📃 Solution for Exercise M1.03

📃 Solution for Exercise M1.03#

The goal of this exercise is to compare the performance of our classifier in the previous notebook (roughly 81% accuracy with LogisticRegression) to some simple baseline classifiers. The simplest baseline classifier is one that always predicts the same class, irrespective of the input data.

What would be the score of a model that always predicts ' >50K'?
What would be the score of a model that always predicts ' <=50K'?
Is 81% or 82% accuracy a good score for this problem?

Use a DummyClassifier and do a train-test split to evaluate its accuracy on the test set. This link shows a few examples of how to evaluate the generalization performance of these baseline models.

import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

We first split our dataset to have the target separated from the data used to train our predictive model.

target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=target_name)

We start by selecting only the numerical columns as seen in the previous notebook.

numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]

data_numeric = data[numerical_columns]

Split the data and target into a train and test set.

from sklearn.model_selection import train_test_split

# solution
data_numeric_train, data_numeric_test, target_train, target_test = (
    train_test_split(data_numeric, target, random_state=42)
)

Use a DummyClassifier such that the resulting classifier always predict the class ' >50K'. What is the accuracy score on the test set? Repeat the experiment by always predicting the class ' <=50K'.

Hint: you can set the strategy parameter of the DummyClassifier to achieve the desired behavior.

from sklearn.dummy import DummyClassifier

# solution
class_to_predict = " >50K"
high_revenue_clf = DummyClassifier(
    strategy="constant", constant=class_to_predict
)
high_revenue_clf.fit(data_numeric_train, target_train)
score = high_revenue_clf.score(data_numeric_test, target_test)
print(f"Accuracy of a model predicting only high revenue: {score:.3f}")

Accuracy of a model predicting only high revenue: 0.234

We clearly see that the score is below 0.5 which might be surprising at first. We now check the generalization performance of a model which always predict the low revenue class, i.e. " <=50K".

class_to_predict = " <=50K"
low_revenue_clf = DummyClassifier(
    strategy="constant", constant=class_to_predict
)
low_revenue_clf.fit(data_numeric_train, target_train)
score = low_revenue_clf.score(data_numeric_test, target_test)
print(f"Accuracy of a model predicting only low revenue: {score:.3f}")

Accuracy of a model predicting only low revenue: 0.766

We observe that this model has an accuracy higher than 0.5. This is due to the fact that we have 3/4 of the target belonging to low-revenue class.

Therefore, any predictive model giving results below this dummy classifier would not be helpful.

adult_census["class"].value_counts()

class
<=50K    37155
>50K     11687
Name: count, dtype: int64

(target == " <=50K").mean()

np.float64(0.7607182343065395)

In practice, we could have the strategy "most_frequent" to predict the class that appears the most in the training target.

most_freq_revenue_clf = DummyClassifier(strategy="most_frequent")
most_freq_revenue_clf.fit(data_numeric_train, target_train)
score = most_freq_revenue_clf.score(data_numeric_test, target_test)
print(f"Accuracy of a model predicting the most frequent class: {score:.3f}")

Accuracy of a model predicting the most frequent class: 0.766

So the LogisticRegression accuracy (roughly 81%) seems better than the DummyClassifier accuracy (roughly 76%). In a way it is a bit reassuring, using a machine learning model gives you a better performance than always predicting the majority class, i.e. the low income class " <=50K".