π Exercise M4.02#
In a previous notebook we introduced the use of performance metrics to evaluate a clustering model when we have access to labeled data, namely the V-measure and Adjusted Rand Index (ARI). In this exercise you will get familiar with another supervised metric for clustering, known as Adjusted Mutual Information (AMI).
To illustrate the different concepts, we retain some of the features from the penguins dataset.
import pandas as pd
columns_to_keep = [
"Culmen Length (mm)",
"Culmen Depth (mm)",
"Flipper Length (mm)",
"Body Mass (g)",
"Sex",
"Species",
]
penguins = pd.read_csv("../datasets/penguins.csv")[columns_to_keep].dropna()
species = penguins["Species"].str.split(" ").str[0]
penguins = penguins.drop(columns=["Species"])
penguins
| Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Sex | |
|---|---|---|---|---|---|
| 0 | 39.1 | 18.7 | 181.0 | 3750.0 | MALE |
| 1 | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE |
| 2 | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE |
| 4 | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE |
| 5 | 39.3 | 20.6 | 190.0 | 3650.0 | MALE |
| ... | ... | ... | ... | ... | ... |
| 339 | 55.8 | 19.8 | 207.0 | 4000.0 | MALE |
| 340 | 43.5 | 18.1 | 202.0 | 3400.0 | FEMALE |
| 341 | 49.6 | 18.2 | 193.0 | 3775.0 | MALE |
| 342 | 50.8 | 19.0 | 210.0 | 4100.0 | MALE |
| 343 | 50.2 | 18.7 | 198.0 | 3775.0 | FEMALE |
334 rows Γ 5 columns
We recall that the silhouette score presented a maximum when n_clusters=6
when using all of the features above (not the species). Our hypothesis was
that those clusters correspond to the 3 species of penguins present in the
dataset (Adelie, Chinstrap, and Gentoo) further splitted by Sex (2 clusters
for each species).
Repeat the same pipeline consisting of a OneHotEncoder with
drop="if_binary" for the βSexβ column, a StandardScaler for the other
columns. The final estimator should be KMeans with n_clusters=6. You can
set the random_state for reproducibility, but that should not change the
interpretation of the results.
# Write your code here.
Make two sns.scatterplot of βCulmen Length (mm)β versus βFlipper Length
(mm)β, side-by-side. On one of them, the hue should be the βspecies and sexβ
coming from the known information in the dataset, and the hue in the other
should be the cluster labels.
Only the colors may differ, as the ordering of the labels is arbitrary (both for the k-means cluster and the βtrueβ labels).
# Write your code here.
We now have a visual intuition of the agreement between the clusters found by k-means and the combination of the βSpeciesβ and βSexβ labels. We can further quantify it using the Adjusted Mutual Information (AMI) score.
Use
sklearn.metrics.adjusted_mutual_info_score
to compare both sets of labels. The AMI returns a value of 1 when the two
partitions are identical (ie perfectly matched)
# Write your code here.
Now use a
sklearn.preprocessing.LabelEncoder
to fit_transform the βtrueβ labels (coming from combinations of species and
sex). What would be the accuracy if we tried to use it to measure the
agreement between both sets of labels?
# Write your code here.
Permute the cluster labels using np.random.permutation, then compute both
the AMI and the accuracy when comparing the true and permuted labels. Are they
sensitive to relabeling?
# Write your code here.
AMI is designed to return a value near zero (it can be negative) when the clustering is no better than random.
To understand how AMI corrects for chance, compare the true labels with a
completely random labeling using np.random.randint to generate as many
labels as rows in the dataset, each containing a value between 0 and 5 (to
match the number of clusters).
# Write your code here.
We can conclude by comparing AMI to other metrics:
Adjusted Rand Index (ARI): Also corrects for chance, but it counts pairs of points, in other words, how many pairs that are together in the true labels are also together in the clusters. It is combinatorial, not based on information-theory as AMI.
V-measure: Based on homogeneity (do clusters contain mostly one class?) and completeness (are all members of a class grouped together?), but it does not correct for chance. If you run a random clustering, V-measure might still give a misleadingly non-zero score, unlike AMI or ARI.