Making clusters part of a supervised pipeline#
This notebook explores how K-means clustering can be used as a feature engineering step to improve the performance of a regression model.
Here we use the California Housing dataset, which includes information about the geographic location (latitude and longitude).
Our goal is to predict the median house value (MedHouseVal) using a ridge regression model, to investigate whether adding features derived from applying K-means to geographic coordinates can improve the pipelineβs predictive performance.
from sklearn.datasets import fetch_california_housing
data, target = fetch_california_housing(return_X_y=True, as_frame=True)
target *= 100 # rescale the target in k$
We can first design a predictive pipeline that completly ignores the coordinates:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.compose import make_column_transformer
data_train, data_test, target_train, target_test = train_test_split(
data, target, test_size=0.2, random_state=0
)
geo_columns = ["Latitude", "Longitude"]
model_drop_geo = make_pipeline(
make_column_transformer(("drop", geo_columns), remainder="passthrough"),
StandardScaler(),
Ridge(alpha=1e-12),
)
test_error_drop_geo = -cross_val_score(
model_drop_geo, data_train, target_train, scoring="neg_mean_absolute_error"
)
print(
"The test MAE without geographical features is: "
f"{test_error_drop_geo.mean():.2f} Β± {test_error_drop_geo.std():.2f} k$"
)
The test MAE without geographical features is: 57.35 Β± 0.53 k$
We observe a Mean Absolute Error of approximately 57k$ when dropping the geographical features.
As seen in the previous notebook, we suspect that the price information may be linked to the distance to the nearest urban center, and proximity to the coast:
import plotly.express as px
fig = px.scatter_map(
data,
lat="Latitude",
lon="Longitude",
color=target,
zoom=5,
height=600,
labels={"color": "price (k$)"},
)
fig.update_layout(
mapbox_style="open-street-map",
mapbox_center={
"lat": data["Latitude"].mean(),
"lon": data["Longitude"].mean(),
},
margin={"r": 0, "t": 0, "l": 0, "b": 0},
)
fig.show(renderer="notebook")
We first feed the coordinates directly to the linear model without transformation.
model_naive_geo = make_pipeline(StandardScaler(), Ridge(alpha=1e-12))
test_error_naive_geo = -cross_val_score(
model_naive_geo,
data_train,
target_train,
scoring="neg_mean_absolute_error",
)
print(
"The test MAE with raw geographical features is: "
f"{test_error_naive_geo.mean():.2f} Β± {test_error_naive_geo.std():.2f} k$"
)
The test MAE with raw geographical features is: 53.13 Β± 0.76 k$
Including the geospatial data naively improves the performance a bit, however, we suspect that we can do better by introducing features that represent proximity to points of interests (urban centers, the coast, parks, etc.).
We could look for a dataset containing all the coordinates of the city centers, the coast line and other points of interest in California, then manually engineer such features. However this would require a non-trivial amount of code. Instead we can rely on K-means to achieve something similar implicitly: we use it to find a large number of centroids from the housing data directly and consider each centroid a potential point of interest.
The KMeans class implements a transform method that, given a set of data
points as an argument, computes the distance to the nearest centroid for each
of them. As a result, KMeans can be used as a preprocessing step in a
feature engineering pipeline as follows:
from sklearn.cluster import KMeans
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
model_cluster_geo = make_pipeline(
make_column_transformer(
(KMeans(n_clusters=100), geo_columns),
remainder="passthrough",
force_int_remainder_cols=False, # Silence a warning in scikit-learn v1.6
),
StandardScaler(),
Ridge(alpha=1e-12),
)
test_error_cluster_geo = -cross_val_score(
model_cluster_geo,
data_train,
target_train,
scoring="neg_mean_absolute_error",
)
print(
"The test MAE with clustered geographical features is:"
f" {test_error_cluster_geo.mean():.2f} Β±"
f" {test_error_cluster_geo.std():.2f} k$"
)
The test MAE with clustered geographical features is: 42.16 Β± 0.95 k$
The resulting mean test error is much lower. Furthermore, KMeans is now part
of a supervised pipeline, which means we can use a grid-search to tune
n_clusters as we have learned in previous modules of this course.
from sklearn.model_selection import GridSearchCV
param_name = "columntransformer__kmeans__n_clusters"
param_grid = {param_name: [10, 30, 100, 300, 1_000, 3_000]}
grid_search = GridSearchCV(
model_cluster_geo, param_grid=param_grid, scoring="neg_mean_absolute_error"
)
grid_search.fit(data_train, target_train)
GridSearchCV(estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(force_int_remainder_cols=False,
remainder='passthrough',
transformers=[('kmeans',
KMeans(n_clusters=100),
['Latitude',
'Longitude'])])),
('standardscaler', StandardScaler()),
('ridge', Ridge(alpha=1e-12))]),
param_grid={'columntransformer__kmeans__n_clusters': [10, 30, 100,
300, 1000,
3000]},
scoring='neg_mean_absolute_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(force_int_remainder_cols=False,
remainder='passthrough',
transformers=[('kmeans',
KMeans(n_clusters=100),
['Latitude',
'Longitude'])])),
('standardscaler', StandardScaler()),
('ridge', Ridge(alpha=1e-12))]),
param_grid={'columntransformer__kmeans__n_clusters': [10, 30, 100,
300, 1000,
3000]},
scoring='neg_mean_absolute_error')Pipeline(steps=[('columntransformer',
ColumnTransformer(force_int_remainder_cols=False,
remainder='passthrough',
transformers=[('kmeans',
KMeans(n_clusters=3000),
['Latitude', 'Longitude'])])),
('standardscaler', StandardScaler()),
('ridge', Ridge(alpha=1e-12))])ColumnTransformer(force_int_remainder_cols=False, remainder='passthrough',
transformers=[('kmeans', KMeans(n_clusters=3000),
['Latitude', 'Longitude'])])['Latitude', 'Longitude']
KMeans(n_clusters=3000)
['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup']
passthrough
StandardScaler()
Ridge(alpha=1e-12)
results_columns = [
"mean_test_score",
"std_test_score",
"mean_fit_time",
"std_fit_time",
"mean_score_time",
"std_score_time",
"param_" + param_name,
]
grid_search_results = pd.DataFrame(grid_search.cv_results_)[results_columns]
grid_search_results["mean_test_error"] = -grid_search_results[
"mean_test_score"
]
grid_search_results = (
grid_search_results.drop(columns=["mean_test_score"])
.rename(columns={"param_" + param_name: "n_clusters"})
.round(3)
)
grid_search_results.sort_values("mean_test_error", ascending=False)
| std_test_score | mean_fit_time | std_fit_time | mean_score_time | std_score_time | n_clusters | mean_test_error | |
|---|---|---|---|---|---|---|---|
| 0 | 0.659 | 0.018 | 0.001 | 0.004 | 0.000 | 10 | 51.609 |
| 1 | 1.007 | 0.034 | 0.001 | 0.005 | 0.000 | 30 | 47.442 |
| 2 | 1.211 | 0.088 | 0.002 | 0.010 | 0.001 | 100 | 42.542 |
| 3 | 0.899 | 0.291 | 0.008 | 0.019 | 0.001 | 300 | 37.297 |
| 4 | 0.549 | 0.927 | 0.017 | 0.038 | 0.001 | 1000 | 32.836 |
| 5 | 0.635 | 3.758 | 0.021 | 0.095 | 0.006 | 3000 | 29.907 |
Larger number of clusters increases the predictive performance at the cost of larger fitting and prediction times.
labels = {
"mean_fit_time": "CV fit time (s)",
"mean_test_error": "CV score (MAE)",
}
fig = px.scatter(
grid_search_results,
x="mean_fit_time",
y="mean_test_error",
error_x="std_fit_time",
error_y="std_test_score",
hover_data=grid_search_results.columns,
labels=labels,
)
fig.update_layout(
title={
"text": "Trade-off between fit time and mean test score",
"y": 0.95,
"x": 0.5,
"xanchor": "center",
"yanchor": "top",
}
)
fig.show(renderer="notebook")
We can finally evaluate the best model found by our analysis and see how well it can generalize.
best_n_clusters = grid_search.best_params_[param_name]
print(
f"The test MAE with {best_n_clusters} clusters is: "
f"{-grid_search.score(data_test, target_test):.3f} k$"
)
The test MAE with 3000 clusters is: 29.381 k$
This notebook demonstrates one way to leverage clustering for non-linear feature engineering, but there are many ways to compose unsupervised models in a supervised-learning pipeline.
We can finally observe that even if K-means was not the best clustering algorithm from a qualitatively point of view (as presented in the previous notebook), it is still helpful at crafting predictive features.