π Wrap-up quiz 4#
This quiz requires some programming to be answered.
Load the periodic_signals.csv dataset with the following cell of code. It
contains readings from 170 industrial sensors installed throughout a
manufacturing facility. Each sensor records the average power consumption (in
watts) every minute for a specific machine, with measurements taken every
minute. Different machines operate with their own characteristic cycles. Rare
events, such as machinery faults or unexpected disturbances, appear as signals
with abnormal frequency patterns. The goal is to identify those disturbances
using the tools we have learned during this module.
import pandas as pd
periodic_signals = pd.read_csv("../datasets/periodic_signals.csv")
_ = periodic_signals.iloc[0].plot(
xlabel="time (minutes)",
ylabel="power (Watts)",
title="Signal from the first sensor",
)
Letβs see if we can find one or more stable candidates for the number of
clusters (n_clusters) using the silhouette score when resampling the
dataset. For such purpose:
Create a pipeline consisting of a
RobustScaler(as it is a good scaling option when dealing with outliers), followed byKMeanswithn_init=5.You can choose to set the
random_state=0value of theKMeansstep, but fixing it or not should not change the conclusions.Generate randomly resampled data consisting of 90% of the data by using
train_test_splitwithtrain_size=0.9. Change therandom_statein thetrain_test_splitto try around 20 different resamplings. You can use theplot_n_clusters_scoresfunction (or a simplified version of it) inside aforloop as we did in a previous exercise.In each resampling, compute the silhouette score for
n_clustersvarying inrange(2, 11).
Question
Using the silhouette score heuristics, select the correct statements:
a) 3 or 4 clusters maximize the score and are resonably stable choices.
b) 5 or 6 clusters maximize the score and are resonably stable choices.
c) 7 or 8 clusters maximize the score and are resonably stable choices.
d) Scores in this range of
n_clustersare always negative, denoting a bad clustering model.e) Scores in this range of
n_clustersare always positive, but hint to a weak to moderate cluster cohesion/separation.
Select all answers that apply
Question
Set n_clusters=8 in the KMeans step of your previous pipeline for the rest
of this quiz. We are going to define an outlier_score using the minimum
distance to any centroid (using the fit_transform method of the
pipeline).
What are the indices of the 5 signals that are the farthest from any centroid?
a) [ 77 32 112 105 101]
b) [ 92 49 101 132 146]
c) [ 80 49 121 150 101]
d) [ 64 98 118 163 121]
Select a single answer
Hint: You can make use of
numpy.min
and
numpy.argsort.
Also, remember that the output of fit_transform is a numpy array of shape
(n_samples, n_clusters).
Question
Create an HDBSCAN model (no need for scaling) with min_cluster_size=10.
How many clusters (excluding the noise label, which is not a cluster) are
found by this model?
a) 5
b) 6
c) 7
d) 8
Select a single answer
Question
How many signals are identified as noise?
a) 3
b) 5
c) 7
d) 9
Select a single answer
A priori we donβt know if the signals are isotropic or follow a gaussian distribution in the feature space (i.e. if they form spherical blobs). Because of that, we donβt know if a centroid-based or a density-based clustering is more suitable. We would like to compare the results from both models, but we know that the presence of outliers makes the silhouette score tricky to interpret. We can still use other metrics, such as Adjusted Mutual Information (AMI), to compare both models.
But first we need k-means to have a similar behavior to HDBSCAN. For such
purpose, we can identify the points that are too far from any centroid as
outliers using the outlier_score as defined before. Instead of setting a
fixed distance threshold, we can flag the n_outliers signals with the
highest outlier scores as -1.
For such purpose:
Cluster your signals with
KMeans(usingfit_predict) to getkmeans_labels.For a range of values of
n_outliers, re-label then_outlierswith highestoutlier_scoreto-1.Compute and plot the AMI between this modified KMeans labeling and the HDBSCAN cluster labels as a function of
n_outliers.
Question
If we denote by n_noise the number of signals identified as noise by
HDBSCAN, select the true statements:
a) AMI reaches a maximum when
n_outliers<n_noise, some points marked as noise by HDBSCAN are not clearly isolated from a centroid.b) AMI reaches a maximum when
n_outliers=n_noise, the two models most strongly agree.c) AMI reaches a maximum when
n_outliers>n_noise, k-means has created small clusters (with fewer thanmin_cluster_sizesamples) that match what HDBSCAN considers noise.d) AMI is too close to zero, indicating coincidences between models are mostly random.
Select a single answer