📝 Exercise M1.03

📝 Exercise M1.03#

The goal of this exercise is to compare the performance of our classifier in the previous notebook (roughly 81% accuracy with LogisticRegression) to some simple baseline classifiers. The simplest baseline classifier is one that always predicts the same class, irrespective of the input data.

What would be the score of a model that always predicts ' >50K'?
What would be the score of a model that always predicts ' <=50K'?
Is 81% or 82% accuracy a good score for this problem?

Use a DummyClassifier and do a train-test split to evaluate its accuracy on the test set. This link shows a few examples of how to evaluate the generalization performance of these baseline models.

import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

We first split our dataset to have the target separated from the data used to train our predictive model.

target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=target_name)

We start by selecting only the numerical columns as seen in the previous notebook.

numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]

data_numeric = data[numerical_columns]

Split the data and target into a train and test set.

from sklearn.model_selection import train_test_split

# Write your code here.

Use a DummyClassifier such that the resulting classifier always predict the class ' >50K'. What is the accuracy score on the test set? Repeat the experiment by always predicting the class ' <=50K'.

Hint: you can set the strategy parameter of the DummyClassifier to achieve the desired behavior.

from sklearn.dummy import DummyClassifier

# Write your code here.