# Preprocessing for numerical featuresΒΆ

In this notebook, we will still use only numerical features.

We will introduce these new aspects:

an example of preprocessing, namely

**scaling numerical variables**;using a scikit-learn

**pipeline**to chain preprocessing and model training;assessing the generalization performance of our model via

**cross-validation**instead of a single train-test split.

## Data preparationΒΆ

First, letβs load the full adult census dataset.

```
import pandas as pd
adult_census = pd.read_csv("../datasets/adult-census.csv")
```

```
# to display nice model diagram
from sklearn import set_config
set_config(display='diagram')
```

We will now drop the target from the data we will use to train our predictive model.

```
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=target_name)
```

Then, we select only the numerical columns, as seen in the previous notebook.

```
numerical_columns = [
"age", "capital-gain", "capital-loss", "hours-per-week"]
data_numeric = data[numerical_columns]
```

Finally, we can divide our dataset into a train and test sets.

```
from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = train_test_split(
data_numeric, target, random_state=42)
```

## Model fitting with preprocessingΒΆ

A range of preprocessing algorithms in scikit-learn allow us to transform the input data before training a model. In our case, we will standardize the data and then train a new logistic regression model on that new version of the dataset.

Letβs start by printing some statistics about the training data.

```
data_train.describe()
```

age | capital-gain | capital-loss | hours-per-week | |
---|---|---|---|---|

count | 36631.000000 | 36631.000000 | 36631.000000 | 36631.000000 |

mean | 38.642352 | 1087.077721 | 89.665311 | 40.431247 |

std | 13.725748 | 7522.692939 | 407.110175 | 12.423952 |

min | 17.000000 | 0.000000 | 0.000000 | 1.000000 |

25% | 28.000000 | 0.000000 | 0.000000 | 40.000000 |

50% | 37.000000 | 0.000000 | 0.000000 | 40.000000 |

75% | 48.000000 | 0.000000 | 0.000000 | 45.000000 |

max | 90.000000 | 99999.000000 | 4356.000000 | 99.000000 |

We see that the datasetβs features span across different ranges. Some algorithms make some assumptions regarding the feature distributions and usually normalizing features will be helpful to address these assumptions.

Tip

Here are some reasons for scaling features:

Models that rely on the distance between a pair of samples, for instance k-nearest neighbors, should be trained on normalized features to make each feature contribute approximately equally to the distance computations.

Many models such as logistic regression use a numerical solver (based on gradient descent) to find their optimal parameters. This solver converges faster when the features are scaled.

Whether or not a machine learning model requires scaling the features depends on the model family. Linear models such as logistic regression generally benefit from scaling the features while other models such as decision trees do not need such preprocessing (but will not suffer from it).

We show how to apply such normalization using a scikit-learn transformer
called `StandardScaler`

. This transformer shifts and scales each feature
individually so that they all have a 0-mean and a unit standard deviation.

We will investigate different steps used in scikit-learn to achieve such a transformation of the data.

First, one needs to call the method `fit`

in order to learn the scaling from
the data.

```
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data_train)
```

StandardScaler()

The `fit`

method for transformers is similar to the `fit`

method for
predictors. The main difference is that the former has a single argument (the
data matrix), whereas the latter has two arguments (the data matrix and the
target).

In this case, the algorithm needs to compute the mean and standard deviation for each feature and store them into some NumPy arrays. Here, these statistics are the model states.

Note

The fact that the model states of this scaler are arrays of means and
standard deviations is specific to the `StandardScaler`

. Other
scikit-learn transformers will compute different statistics and store them
as model states, in the same fashion.

We can inspect the computed means and standard deviations.

```
scaler.mean_
```

```
array([ 38.64235211, 1087.07772106, 89.6653108 , 40.43124676])
```

```
scaler.scale_
```

```
array([ 13.72556083, 7522.59025606, 407.10461772, 12.42378265])
```

Note

scikit-learn convention: if an attribute is learned from the data, its name
ends with an underscore (i.e. `_`

), as in `mean_`

and `scale_`

for the
`StandardScaler`

.

Scaling the data is applied to each feature individually (i.e. each column in the data matrix). For each feature, we subtract its mean and divide by its standard deviation.

Once we have called the `fit`

method, we can perform data transformation by
calling the method `transform`

.

```
data_train_scaled = scaler.transform(data_train)
data_train_scaled
```

```
array([[ 0.17177061, -0.14450843, 5.71188483, -2.28845333],
[ 0.02605707, -0.14450843, -0.22025127, -0.27618374],
[-0.33822677, -0.14450843, -0.22025127, 0.77019645],
...,
[-0.77536738, -0.14450843, -0.22025127, -0.03471139],
[ 0.53605445, -0.14450843, -0.22025127, -0.03471139],
[ 1.48319243, -0.14450843, -0.22025127, -2.69090725]])
```

Letβs illustrate the internal mechanism of the `transform`

method and put it
to perspective with what we already saw with predictors.

The `transform`

method for transformers is similar to the `predict`

method
for predictors. It uses a predefined function, called a **transformation
function**, and uses the model states and the input data. However, instead of
outputting predictions, the job of the `transform`

method is to output a
transformed version of the input data.

Finally, the method `fit_transform`

is a shorthand method to call
successively `fit`

and then `transform`

.

```
data_train_scaled = scaler.fit_transform(data_train)
data_train_scaled
```

```
array([[ 0.17177061, -0.14450843, 5.71188483, -2.28845333],
[ 0.02605707, -0.14450843, -0.22025127, -0.27618374],
[-0.33822677, -0.14450843, -0.22025127, 0.77019645],
...,
[-0.77536738, -0.14450843, -0.22025127, -0.03471139],
[ 0.53605445, -0.14450843, -0.22025127, -0.03471139],
[ 1.48319243, -0.14450843, -0.22025127, -2.69090725]])
```

```
data_train_scaled = pd.DataFrame(data_train_scaled,
columns=data_train.columns)
data_train_scaled.describe()
```

age | capital-gain | capital-loss | hours-per-week | |
---|---|---|---|---|

count | 3.663100e+04 | 3.663100e+04 | 3.663100e+04 | 3.663100e+04 |

mean | -2.273364e-16 | 3.530310e-17 | 3.840667e-17 | 1.844684e-16 |

std | 1.000014e+00 | 1.000014e+00 | 1.000014e+00 | 1.000014e+00 |

min | -1.576792e+00 | -1.445084e-01 | -2.202513e-01 | -3.173852e+00 |

25% | -7.753674e-01 | -1.445084e-01 | -2.202513e-01 | -3.471139e-02 |

50% | -1.196565e-01 | -1.445084e-01 | -2.202513e-01 | -3.471139e-02 |

75% | 6.817680e-01 | -1.445084e-01 | -2.202513e-01 | 3.677425e-01 |

max | 3.741752e+00 | 1.314865e+01 | 1.047970e+01 | 4.714245e+00 |

Notice that the mean of all the columns is close to 0 and the standard deviation
in all cases is close to 1.
We can also visualize the effect of `StandardScaler`

using a jointplot to show
both the histograms of the distributions and a scatterplot of any pair of numerical
features at the same time. We can observe that `StandardScaler`

does not change
the structure of the data itself but the axes get shifted and scaled.

```
import matplotlib.pyplot as plt
import seaborn as sns
# number of points to visualize to have a clearer plot
num_points_to_plot = 300
sns.jointplot(data=data_train[:num_points_to_plot], x="age",
y="hours-per-week", marginal_kws=dict(bins=15))
plt.suptitle("Jointplot of 'age' vs 'hours-per-week' \nbefore StandardScaler", y=1.1)
sns.jointplot(data=data_train_scaled[:num_points_to_plot], x="age",
y="hours-per-week", marginal_kws=dict(bins=15))
_ = plt.suptitle("Jointplot of 'age' vs 'hours-per-week' \nafter StandardScaler", y=1.1)
```

We can easily combine sequential operations with a scikit-learn
`Pipeline`

, which chains together operations and is used as any other
classifier or regressor. The helper function `make_pipeline`

will create a
`Pipeline`

: it takes as arguments the successive transformations to perform,
followed by the classifier or regressor model.

```
import time
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(), LogisticRegression())
model
```

Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])

StandardScaler()

LogisticRegression()

The `make_pipeline`

function did not require us to give a name to each step.
Indeed, it was automatically assigned based on the name of the classes
provided; a `StandardScaler`

will be a step named `"standardscaler"`

in the
resulting pipeline. We can check the name of each steps of our model:

```
model.named_steps
```

```
{'standardscaler': StandardScaler(),
'logisticregression': LogisticRegression()}
```

This predictive pipeline exposes the same methods as the final predictor:
`fit`

and `predict`

(and additionally `predict_proba`

, `decision_function`

,
or `score`

).

```
start = time.time()
model.fit(data_train, target_train)
elapsed_time = time.time() - start
```

We can represent the internal mechanism of a pipeline when calling `fit`

by the following diagram:

When calling `model.fit`

, the method `fit_transform`

from each underlying
transformer (here a single transformer) in the pipeline will be called to:

learn their internal model states

transform the training data. Finally, the preprocessed data are provided to train the predictor.

To predict the targets given a test set, one uses the `predict`

method.

```
predicted_target = model.predict(data_test)
predicted_target[:5]
```

```
array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' <=50K'], dtype=object)
```

Letβs show the underlying mechanism:

The method `transform`

of each transformer (here a single transformer) is
called to preprocess the data. Note that there is no need to call the `fit`

method for these transformers because we are using the internal model states
computed when calling `model.fit`

. The preprocessed data is then provided to
the predictor that will output the predicted target by calling its method
`predict`

.

As a shorthand, we can check the score of the full predictive pipeline
calling the method `model.score`

. Thus, letβs check the computational and
generalization performance of such a predictive pipeline.

```
model_name = model.__class__.__name__
score = model.score(data_test, target_test)
print(f"The accuracy using a {model_name} is {score:.3f} "
f"with a fitting time of {elapsed_time:.3f} seconds "
f"in {model[-1].n_iter_[0]} iterations")
```

```
The accuracy using a Pipeline is 0.807 with a fitting time of 0.094 seconds in 12 iterations
```

We could compare this predictive model with the predictive model used in the previous notebook which did not scale features.

```
model = LogisticRegression()
start = time.time()
model.fit(data_train, target_train)
elapsed_time = time.time() - start
```

```
model_name = model.__class__.__name__
score = model.score(data_test, target_test)
print(f"The accuracy using a {model_name} is {score:.3f} "
f"with a fitting time of {elapsed_time:.3f} seconds "
f"in {model.n_iter_[0]} iterations")
```

```
The accuracy using a LogisticRegression is 0.807 with a fitting time of 0.211 seconds in 59 iterations
```

We see that scaling the data before training the logistic regression was beneficial in terms of computational performance. Indeed, the number of iterations decreased as well as the training time. The generalization performance did not change since both models converged.

Warning

Working with non-scaled data will potentially force the algorithm to iterate
more as we showed in the example above. There is also the catastrophic
scenario where the number of required iterations are more than the maximum
number of iterations allowed by the predictor (controlled by the `max_iter`

)
parameter. Therefore, before increasing `max_iter`

, make sure that the data
are well scaled.

## Model evaluation using cross-validationΒΆ

In the previous example, we split the original data into a training set and a testing set. The score of a model will in general depend on the way we make such a split. One downside of doing a single split is that it does not give any information about this variability. Another downside, in a setting where the amount of data is small, is that the the data available for training and testing will be even smaller after splitting.

Instead, we can use cross-validation. Cross-validation consists of repeating the procedure such that the training and testing sets are different each time. Generalization performance metrics are collected for each repetition and then aggregated. As a result we can get an estimate of the variability of the modelβs generalization performance.

Note that there exists several cross-validation strategies, each of them
defines how to repeat the `fit`

/`score`

procedure. In this section, we will
use the K-fold strategy: the entire dataset is split into `K`

partitions. The
`fit`

/`score`

procedure is repeated `K`

times where at each iteration `K - 1`

partitions are used to fit the model and `1`

partition is used to score. The
figure below illustrates this K-fold strategy.

Note

This figure shows the particular case of K-fold cross-validation strategy. As mentioned earlier, there are a variety of different cross-validation strategies. Some of these aspects will be covered in more details in future notebooks.

For each cross-validation split, the procedure trains a model on all the red samples and evaluate the score of the model on the blue samples. Cross-validation is therefore computationally intensive because it requires training several models instead of one.

In scikit-learn, the function `cross_validate`

allows to do cross-validation
and you need to pass it the model, the data, and the target. Since there
exists several cross-validation strategies, `cross_validate`

takes a
parameter `cv`

which defines the splitting strategy.

```
%%time
from sklearn.model_selection import cross_validate
model = make_pipeline(StandardScaler(), LogisticRegression())
cv_result = cross_validate(model, data_numeric, target, cv=5)
cv_result
```

```
CPU times: user 1.01 s, sys: 300 ms, total: 1.31 s
Wall time: 686 ms
```

```
{'fit_time': array([0.09947348, 0.10267353, 0.10435319, 0.09859276, 0.09749627]),
'score_time': array([0.02080703, 0.02168918, 0.02666163, 0.02029109, 0.01948094]),
'test_score': array([0.79557785, 0.80049135, 0.79965192, 0.79873055, 0.80436118])}
```

The output of `cross_validate`

is a Python dictionary, which by default
contains three entries: (i) the time to train the model on the training data
for each fold, (ii) the time to predict with the model on the testing data
for each fold, and (iii) the default score on the testing data for each fold.

Setting `cv=5`

created 5 distinct splits to get 5 variations for the training
and testing sets. Each training set is used to fit one model which is then
scored on the matching test set. This strategy is called K-fold
cross-validation where `K`

corresponds to the number of splits.

Note that by default the `cross_validate`

function discards the 5 models that
were trained on the different overlapping subset of the dataset. The goal of
cross-validation is not to train a model, but rather to estimate
approximately the generalization performance of a model that would have been
trained to the full training set, along with an estimate of the variability
(uncertainty on the generalization accuracy).

You can pass additional parameters to `cross_validate`

to get more
information, for instance training scores. These features will be covered in
a future notebook.

Letβs extract the test scores from the `cv_result`

dictionary and compute
the mean accuracy and the variation of the accuracy across folds.

```
scores = cv_result["test_score"]
print("The mean cross-validation accuracy is: "
f"{scores.mean():.3f} +/- {scores.std():.3f}")
```

```
The mean cross-validation accuracy is: 0.800 +/- 0.003
```

Note that by computing the standard-deviation of the cross-validation scores, we can estimate the uncertainty of our model generalization performance. This is the main advantage of cross-validation and can be crucial in practice, for example when comparing different models to figure out whether one is better than the other or whether the generalization performance differences are within the uncertainty.

In this particular case, only the first 2 decimals seem to be trustworthy. If you go up in this notebook, you can check that the performance we get with cross-validation is compatible with the one from a single train-test split.

In this notebook we have:

seen the importance of

**scaling numerical variables**;used a

**pipeline**to chain scaling and logistic regression training;assessed the generalization performance of our model via

**cross-validation**.