Visualizing scikit-learn pipelines in Jupyter

Visualizing scikit-learn pipelines in Jupyter#

The goal of keeping this notebook is to:

make it available for users that want to reproduce it locally
archive the script in the event we want to rerecord this video with an update in the UI of scikit-learn in a future release.

First we load the dataset#

We need to define our data and target. In this case we build a classification model

import pandas as pd

ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values="?")

target_name = "SalePrice"
data, target = (
    ames_housing.drop(columns=target_name),
    ames_housing[target_name],
)
target = (target > 200_000).astype(int)

We inspect the first rows of the dataframe

data

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	ScreenPorch	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	0	0	NaN	NaN	NaN	0	2	2008	WD	Normal
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	0	0	NaN	NaN	NaN	0	5	2007	WD	Normal
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	0	0	NaN	NaN	NaN	0	9	2008	WD	Normal
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	0	0	NaN	NaN	NaN	0	2	2006	WD	Abnorml
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	0	0	NaN	NaN	NaN	0	12	2008	WD	Normal
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1455	1456	60	RL	62.0	7917	Pave	NaN	Reg	Lvl	AllPub	...	0	0	NaN	NaN	NaN	0	8	2007	WD	Normal
1456	1457	20	RL	85.0	13175	Pave	NaN	Reg	Lvl	AllPub	...	0	0	NaN	MnPrv	NaN	0	2	2010	WD	Normal
1457	1458	70	RL	66.0	9042	Pave	NaN	Reg	Lvl	AllPub	...	0	0	NaN	GdPrv	Shed	2500	5	2010	WD	Normal
1458	1459	20	RL	68.0	9717	Pave	NaN	Reg	Lvl	AllPub	...	0	0	NaN	NaN	NaN	0	4	2010	WD	Normal
1459	1460	20	RL	75.0	9937	Pave	NaN	Reg	Lvl	AllPub	...	0	0	NaN	NaN	NaN	0	6	2008	WD	Normal

1460 rows × 80 columns

For the sake of simplicity, we can cherry-pick some features and only retain this arbitrary subset of data:

numeric_features = ["LotArea", "FullBath", "HalfBath"]
categorical_features = ["Neighborhood", "HouseStyle"]
data = data[numeric_features + categorical_features]

Then we create the pipeline#

The first step is to define the preprocessing steps

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        (
            "scaler",
            StandardScaler(),
        ),
    ]
)

categorical_transformer = OneHotEncoder(handle_unknown="ignore")

The next step is to apply the transformations using ColumnTransformer

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

Then we define the model and join the steps in order

from sklearn.linear_model import LogisticRegression

model = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", LogisticRegression()),
    ]
)
model

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['LotArea', 'FullBath',
                                                   'HalfBath']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['Neighborhood',
                                                   'HouseStyle'])])),
                ('classifier', LogisticRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Let’s fit it!

model.fit(data, target)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['LotArea', 'FullBath',
                                                   'HalfBath']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['Neighborhood',
                                                   'HouseStyle'])])),
                ('classifier', LogisticRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Notice that the diagram changes color once the estimator is fit.

So far we used Pipeline and ColumnTransformer, which allows us to custom the names of the steps in the pipeline. An alternative is to use make_column_transformer and make_pipeline, they do not require, and do not permit, naming the estimators. Instead, their names are set to the lowercase of their types automatically.

from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

numeric_transformer = make_pipeline(
    SimpleImputer(strategy="median"), StandardScaler()
)
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = make_column_transformer(
    (numeric_transformer, numeric_features),
    (categorical_transformer, categorical_features),
)
model = make_pipeline(preprocessor, LogisticRegression())
model.fit(data, target)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ['LotArea', 'FullBath',
                                                   'HalfBath']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['Neighborhood',
                                                   'HouseStyle'])])),
                ('logisticregression', LogisticRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

?Documentation for PipelineiFitted

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ['LotArea', 'FullBath',
                                                   'HalfBath']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['Neighborhood',
                                                   'HouseStyle'])])),
                ('logisticregression', LogisticRegression())])

columntransformer: ColumnTransformer

?Documentation for columntransformer: ColumnTransformer

ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('standardscaler',
                                                  StandardScaler())]),
                                 ['LotArea', 'FullBath', 'HalfBath']),
                                ('onehotencoder',
                                 OneHotEncoder(handle_unknown='ignore'),
                                 ['Neighborhood', 'HouseStyle'])])

pipeline

['LotArea', 'FullBath', 'HalfBath']

SimpleImputer

?Documentation for SimpleImputer

SimpleImputer(strategy='median')

StandardScaler

?Documentation for StandardScaler

StandardScaler()

onehotencoder

['Neighborhood', 'HouseStyle']

OneHotEncoder

?Documentation for OneHotEncoder

OneHotEncoder(handle_unknown='ignore')

LogisticRegression

?Documentation for LogisticRegression

LogisticRegression()

Finally we can score the model using cross-validation:#

from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data, target, cv=5)
scores = cv_results["test_score"]
print(
    "The mean cross-validation accuracy is: "
    f"{scores.mean():.3f} ± {scores.std():.3f}"
)

The mean cross-validation accuracy is: 0.859 ± 0.018

Note

In this case, around 86% of the times the pipeline correctly predicts whether the price of a house is above or below the 200_000 dollars threshold. But be aware that this score was obtained by picking some features by hand, which is not necessarily the best thing we can do for this classification task. In this example we can hope that fitting a complex machine learning pipelines on a richer set of features can improve upon this performance level.

Reducing a price estimation problem to a binary classification problem with a single threshold at 200_000 dollars is probably too coarse to be useful in in practice. Treating this problem as a regression problem is probably a better idea. We will see later in this MOOC how to train and evaluate the performance of various regression models.

Visualizing scikit-learn pipelines in Jupyter

Contents

Visualizing scikit-learn pipelines in Jupyter#

First we load the dataset#

Then we create the pipeline#

Finally we can score the model using cross-validation:#