📝 Exercise M1.03
📝 Exercise M1.03#
The goal of this exercise is to compare the performance of our classifier in
the previous notebook (roughly 81% accuracy with LogisticRegression
) to
some simple baseline classifiers. The simplest baseline classifier is one
that always predicts the same class, irrespective of the input data.
What would be the score of a model that always predicts
' >50K'
?What would be the score of a model that always predicts
' <=50K'
?Is 81% or 82% accuracy a good score for this problem?
Use a DummyClassifier
and do a train-test split to evaluate
its accuracy on the test set. This
link
shows a few examples of how to evaluate the generalization performance of these
baseline models.
import pandas as pd
adult_census = pd.read_csv("../datasets/adult-census.csv")
We will first split our dataset to have the target separated from the data used to train our predictive model.
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=target_name)
We start by selecting only the numerical columns as seen in the previous notebook.
numerical_columns = [
"age", "capital-gain", "capital-loss", "hours-per-week"]
data_numeric = data[numerical_columns]
Split the data and target into a train and test set.
from sklearn.model_selection import train_test_split
# Write your code here.
Use a DummyClassifier
such that the resulting classifier will always
predict the class ' >50K'
. What is the accuracy score on the test set?
Repeat the experiment by always predicting the class ' <=50K'
.
Hint: you can set the strategy
parameter of the DummyClassifier
to
achieve the desired behavior.
from sklearn.dummy import DummyClassifier
# Write your code here.