Encoding of categorical variables#

In this notebook, we will present typical ways of dealing with categorical variables by encoding them, namely ordinal encoding and one-hot encoding.

Let’s first load the entire adult dataset containing both numerical and categorical data.

import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")
# drop the duplicated column `"education-num"` as stated in the first notebook
adult_census = adult_census.drop(columns="education-num")

target_name = "class"
target = adult_census[target_name]

data = adult_census.drop(columns=[target_name])

Identify categorical variables#

As we saw in the previous section, a numerical variable is a quantity represented by a real or integer number. These variables can be naturally handled by machine learning algorithms that are typically composed of a sequence of arithmetic instructions such as additions and multiplications.

In contrast, categorical variables have discrete values, typically represented by string labels (but not only) taken from a finite list of possible choices. For instance, the variable native-country in our dataset is a categorical variable because it encodes the data using a finite list of possible countries (along with the ? symbol when this information is missing):

data["native-country"].value_counts().sort_index()
native-country
 ?                               857
 Cambodia                         28
 Canada                          182
 China                           122
 Columbia                         85
 Cuba                            138
 Dominican-Republic              103
 Ecuador                          45
 El-Salvador                     155
 England                         127
 France                           38
 Germany                         206
 Greece                           49
 Guatemala                        88
 Haiti                            75
 Holand-Netherlands                1
 Honduras                         20
 Hong                             30
 Hungary                          19
 India                           151
 Iran                             59
 Ireland                          37
 Italy                           105
 Jamaica                         106
 Japan                            92
 Laos                             23
 Mexico                          951
 Nicaragua                        49
 Outlying-US(Guam-USVI-etc)       23
 Peru                             46
 Philippines                     295
 Poland                           87
 Portugal                         67
 Puerto-Rico                     184
 Scotland                         21
 South                           115
 Taiwan                           65
 Thailand                         30
 Trinadad&Tobago                  27
 United-States                 43832
 Vietnam                          86
 Yugoslavia                       23
Name: count, dtype: int64

How can we easily recognize categorical columns among the dataset? Part of the answer lies in the columns’ data type:

data.dtypes
age                int64
workclass         object
education         object
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object

If we look at the "native-country" column, we observe its data type is object, meaning it contains string values.

Select features based on their data type#

In the previous notebook, we manually defined the numerical columns. We could do a similar approach. Instead, we will use the scikit-learn helper function make_column_selector, which allows us to select columns based on their data type. We will illustrate how to use this helper.

from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)
categorical_columns
['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']

Here, we created the selector by passing the data type to include; we then passed the input dataset to the selector object, which returned a list of column names that have the requested data type. We can now filter out the unwanted columns:

data_categorical = data[categorical_columns]
data_categorical.head()
workclass education marital-status occupation relationship race sex native-country
0 Private 11th Never-married Machine-op-inspct Own-child Black Male United-States
1 Private HS-grad Married-civ-spouse Farming-fishing Husband White Male United-States
2 Local-gov Assoc-acdm Married-civ-spouse Protective-serv Husband White Male United-States
3 Private Some-college Married-civ-spouse Machine-op-inspct Husband Black Male United-States
4 ? Some-college Never-married ? Own-child White Female United-States
print(f"The dataset is composed of {data_categorical.shape[1]} features")
The dataset is composed of 8 features

In the remainder of this section, we will present different strategies to encode categorical data into numerical data which can be used by a machine-learning algorithm.

Strategies to encode categories#

Encoding ordinal categories#

The most intuitive strategy is to encode each category with a different number. The OrdinalEncoder will transform the data in such manner. We will start by encoding a single column to understand how the encoding works.

from sklearn.preprocessing import OrdinalEncoder

education_column = data_categorical[["education"]]

encoder = OrdinalEncoder()
education_encoded = encoder.fit_transform(education_column)
education_encoded
array([[ 1.],
       [11.],
       [ 7.],
       ...,
       [11.],
       [11.],
       [11.]])

We see that each category in "education" has been replaced by a numeric value. We could check the mapping between the categories and the numerical values by checking the fitted attribute categories_.

encoder.categories_
[array([' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th', ' 7th-8th',
        ' 9th', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate',
        ' HS-grad', ' Masters', ' Preschool', ' Prof-school',
        ' Some-college'], dtype=object)]

Now, we can check the encoding applied on all categorical features.

data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]
array([[ 4.,  1.,  4.,  7.,  3.,  2.,  1., 39.],
       [ 4., 11.,  2.,  5.,  0.,  4.,  1., 39.],
       [ 2.,  7.,  2., 11.,  0.,  4.,  1., 39.],
       [ 4., 15.,  2.,  7.,  0.,  2.,  1., 39.],
       [ 0., 15.,  4.,  0.,  3.,  4.,  0., 39.]])
print(f"The dataset encoded contains {data_encoded.shape[1]} features")
The dataset encoded contains 8 features

We see that the categories have been encoded for each feature (column) independently. We also note that the number of features before and after the encoding is the same.

However, be careful when applying this encoding strategy: using this integer representation leads downstream predictive models to assume that the values are ordered (0 < 1 < 2 < 3… for instance).

By default, OrdinalEncoder uses a lexicographical strategy to map string category labels to integers. This strategy is arbitrary and often meaningless. For instance, suppose the dataset has a categorical variable named "size" with categories such as β€œS”, β€œM”, β€œL”, β€œXL”. We would like the integer representation to respect the meaning of the sizes by mapping them to increasing integers such as 0, 1, 2, 3. However, the lexicographical strategy used by default would map the labels β€œS”, β€œM”, β€œL”, β€œXL” to 2, 1, 0, 3, by following the alphabetical order.

The OrdinalEncoder class accepts a categories constructor argument to pass categories in the expected ordering explicitly. You can find more information in the scikit-learn documentation if needed.

If a categorical variable does not carry any meaningful order information then this encoding might be misleading to downstream statistical models and you might consider using one-hot encoding instead (see below).

Encoding nominal categories (without assuming any order)#

OneHotEncoder is an alternative encoder that prevents the downstream models to make a false assumption about the ordering of categories. For a given feature, it will create as many new columns as there are possible categories. For a given sample, the value of the column corresponding to the category will be set to 1 while all the columns of the other categories will be set to 0.

We will start by encoding a single feature (e.g. "education") to illustrate how the encoding works.

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
education_encoded = encoder.fit_transform(education_column)
education_encoded
array([[0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Note

sparse_output=False is used in the OneHotEncoder for didactic purposes, namely easier visualization of the data.

Sparse matrices are efficient data structures when most of your matrix elements are zero. They won’t be covered in detail in this course. If you want more details about them, you can look at this.

We see that encoding a single feature will give a NumPy array full of zeros and ones. We can get a better understanding using the associated feature names resulting from the transformation.

feature_names = encoder.get_feature_names_out(input_features=["education"])
education_encoded = pd.DataFrame(education_encoded, columns=feature_names)
education_encoded
education_ 10th education_ 11th education_ 12th education_ 1st-4th education_ 5th-6th education_ 7th-8th education_ 9th education_ Assoc-acdm education_ Assoc-voc education_ Bachelors education_ Doctorate education_ HS-grad education_ Masters education_ Preschool education_ Prof-school education_ Some-college
0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48837 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
48838 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
48839 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
48840 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
48841 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

48842 rows Γ— 16 columns

As we can see, each category (unique value) became a column; the encoding returned, for each sample, a 1 to specify which category it belongs to.

Let’s apply this encoding on the full dataset.

print(f"The dataset is composed of {data_categorical.shape[1]} features")
data_categorical.head()
The dataset is composed of 8 features
workclass education marital-status occupation relationship race sex native-country
0 Private 11th Never-married Machine-op-inspct Own-child Black Male United-States
1 Private HS-grad Married-civ-spouse Farming-fishing Husband White Male United-States
2 Local-gov Assoc-acdm Married-civ-spouse Protective-serv Husband White Male United-States
3 Private Some-college Married-civ-spouse Machine-op-inspct Husband Black Male United-States
4 ? Some-college Never-married ? Own-child White Female United-States
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]
array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,
        1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.]])
print(f"The encoded dataset contains {data_encoded.shape[1]} features")
The encoded dataset contains 102 features

Let’s wrap this NumPy array in a dataframe with informative column names as provided by the encoder object:

columns_encoded = encoder.get_feature_names_out(data_categorical.columns)
pd.DataFrame(data_encoded, columns=columns_encoded).head()
workclass_ ? workclass_ Federal-gov workclass_ Local-gov workclass_ Never-worked workclass_ Private workclass_ Self-emp-inc workclass_ Self-emp-not-inc workclass_ State-gov workclass_ Without-pay education_ 10th ... native-country_ Portugal native-country_ Puerto-Rico native-country_ Scotland native-country_ South native-country_ Taiwan native-country_ Thailand native-country_ Trinadad&Tobago native-country_ United-States native-country_ Vietnam native-country_ Yugoslavia
0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
1 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
3 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
4 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

5 rows Γ— 102 columns

Look at how the "workclass" variable of the 3 first records has been encoded and compare this to the original string representation.

The number of features after the encoding is more than 10 times larger than in the original data because some variables such as occupation and native-country have many possible categories.

Choosing an encoding strategy#

Choosing an encoding strategy will depend on the underlying models and the type of categories (i.e. ordinal vs. nominal).

Note

In general OneHotEncoder is the encoding strategy used when the downstream models are linear models while OrdinalEncoder is often a good strategy with tree-based models.

Using an OrdinalEncoder will output ordinal categories. This means that there is an order in the resulting categories (e.g. 0 < 1 < 2). The impact of violating this ordering assumption is really dependent on the downstream models. Linear models will be impacted by misordered categories while tree-based models will not.

You can still use an OrdinalEncoder with linear models but you need to be sure that:

  • the original categories (before encoding) have an ordering;

  • the encoded categories follow the same ordering than the original categories.

The next exercise shows what can happen when using an OrdinalEncoder with a liner model and the conditions above are not met.

One-hot encoding categorical variables with high cardinality can cause computational inefficiency in tree-based models. Because of this, it is not recommended to use OneHotEncoder in such cases even if the original categories do not have a given order. We will show this in the final exercise of this sequence.

Evaluate our predictive pipeline#

We can now integrate this encoder inside a machine learning pipeline like we did with numerical data: let’s train a linear classifier on the encoded data and check the generalization performance of this machine learning pipeline using cross-validation.

Before we create the pipeline, we have to linger on the native-country. Let’s recall some statistics regarding this column.

data["native-country"].value_counts()
native-country
 United-States                 43832
 Mexico                          951
 ?                               857
 Philippines                     295
 Germany                         206
 Puerto-Rico                     184
 Canada                          182
 El-Salvador                     155
 India                           151
 Cuba                            138
 England                         127
 China                           122
 South                           115
 Jamaica                         106
 Italy                           105
 Dominican-Republic              103
 Japan                            92
 Guatemala                        88
 Poland                           87
 Vietnam                          86
 Columbia                         85
 Haiti                            75
 Portugal                         67
 Taiwan                           65
 Iran                             59
 Greece                           49
 Nicaragua                        49
 Peru                             46
 Ecuador                          45
 France                           38
 Ireland                          37
 Hong                             30
 Thailand                         30
 Cambodia                         28
 Trinadad&Tobago                  27
 Laos                             23
 Yugoslavia                       23
 Outlying-US(Guam-USVI-etc)       23
 Scotland                         21
 Honduras                         20
 Hungary                          19
 Holand-Netherlands                1
Name: count, dtype: int64

We see that the Holand-Netherlands category is occurring rarely. This will be a problem during cross-validation: if the sample ends up in the test set during splitting then the classifier would not have seen the category during training and will not be able to encode it.

In scikit-learn, there are two solutions to bypass this issue:

  • list all the possible categories and provide it to the encoder via the keyword argument categories;

  • use the parameter handle_unknown, i.e. if an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros.

Here, we will use the latter solution for simplicity.

Tip

Be aware the OrdinalEncoder exposes as well a parameter handle_unknown. It can be set to use_encoded_value. If that option is chosen, you can define a fixed value to which all unknowns will be set to during transform. For example, OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=42) will set all values encountered during transform to 42 which are not part of the data encountered during the fit call. You are going to use these parameters in the next exercise.

We can now create our machine learning pipeline.

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

model = make_pipeline(
    OneHotEncoder(handle_unknown="ignore"), LogisticRegression(max_iter=500)
)

Note

Here, we need to increase the maximum number of iterations to obtain a fully converged LogisticRegression and silence a ConvergenceWarning. Contrary to the numerical features, the one-hot encoded categorical features are all on the same scale (values are 0 or 1), so they would not benefit from scaling. In this case, increasing max_iter is the right thing to do.

Finally, we can check the model’s generalization performance only using the categorical columns.

from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data_categorical, target)
cv_results
{'fit_time': array([0.70808434, 0.63763309, 0.64760113, 0.64659452, 0.64873219]),
 'score_time': array([0.02724624, 0.02720666, 0.02717519, 0.0285306 , 0.02753091]),
 'test_score': array([0.83222438, 0.83560242, 0.82872645, 0.83312858, 0.83466421])}
scores = cv_results["test_score"]
print(f"The accuracy is: {scores.mean():.3f} Β± {scores.std():.3f}")
The accuracy is: 0.833 Β± 0.002

As you can see, this representation of the categorical variables is slightly more predictive of the revenue than the numerical variables that we used previously.

In this notebook we have:

  • seen two common strategies for encoding categorical features: ordinal encoding and one-hot encoding;

  • used a pipeline to use a one-hot encoder before fitting a logistic regression.