📝 Exercise M1.04¶
The goal of this exercise is to evaluate the impact of using an arbitrary integer encoding for categorical variables along with a linear classification model such as Logistic Regression.
To do so, let’s try to use
OrdinalEncoder to preprocess the categorical
variables. This preprocessor is assembled in a pipeline with
LogisticRegression. The statistical performance of the pipeline can be
evaluated by cross-validation and then compared to the score obtained when
OneHotEncoder or to some other baseline score.
First, we load the dataset.
import pandas as pd adult_census = pd.read_csv("../datasets/adult-census.csv")
target_name = "class" target = adult_census[target_name] data = adult_census.drop(columns=[target_name, "education-num"])
In the previous notebook, we used
automatically select columns with a specific data type (also called
Here, we will use this selector to get only the columns containing strings
object dtype) that correspond to categorical features in our
from sklearn.compose import make_column_selector as selector categorical_columns_selector = selector(dtype_include=object) categorical_columns = categorical_columns_selector(data) data_categorical = data[categorical_columns]
We filter our dataset that it contains only categorical features.
Define a scikit-learn pipeline composed of an
OrdinalEncoder and a
OrdinalEncoder can raise errors if it sees an unknown category at
prediction time, you can set the
unknown_value parameters. You can refer to the
for more details regarding these parameters.
from sklearn.pipeline import make_pipeline from sklearn.preprocessing import OrdinalEncoder from sklearn.linear_model import LogisticRegression # Write your code here.
Your model is now defined. Evaluate it using a cross-validation using
from sklearn.model_selection import cross_validate # Write your code here.
Now, we would like to compare the statistical performance of our previous
model with a new model where instead of using an
OrdinalEncoder, we will
OneHotEncoder. Repeat the model evaluation using cross-validation.
Compare the score of both models and conclude on the impact of choosing a
specific encoding strategy when using a linear model.
from sklearn.preprocessing import OneHotEncoder # Write your code here.