A46: Evaluating Clustering Algorithm — Unsupervised Machine Learning

K-means, K-medoids, Agglomerative, Silhouette Coefficient, Silhouette Score, Unsupervised Learning, Clustering

Junaid Qazi, PhD
5 min readMay 25, 2022

This article is a part of Data Science from Scratch — Can I to I Can”, A Lecture Notes Book Series (Part-3). (click here to get your copy of Part-1 & Part-2 today)

Click here for the previous article/lecture on “A45: Clustering — Unsupervised Machine Learning”

💐Click here to FOLLOW ME for new contents💐

⚠️ We will be working with seed dataset dataset for learning purpose in this lecture.

✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”

🧘🏻‍♂️ 👉🎯 >> Stay calm and focused! >> 🧘🏻‍♂️ 👉🎯

In the previous lecture, we have trained range of clustering algorithms on seed data. Let’s move on a evaluate those trained clustering models to see which one is the better choice for the selected data.

Silhouette coefficient

Clustering is unsupervised machine learning technique and there is no way to evaluate the model based on accuracy score or other statistics that we use in the supervised machine learning techniques. However, Silhouette interpretation and validation of consistency within clusters of the data.

The silhouette coefficient is a measure of how well samples are clustered with samples that are similar to themselves. In a way, silhouette coefficient or silhouette score is a metric that calculates the goodness of a clustering technique and its value ranges from -1 (worst) to 1 (best). Values near 0 indicate overlapping clusters.

Clustering models with a high Silhouette Coefficient are said to be dense, where samples in the same cluster are similar to each other, and well separated, where samples in different clusters are not very similar to each other.

Moving forward, let’s compute silhouette coefficient (the mean silhouette score) for our trained clustering algorithms on seed data. So, we have five clustering models, and the question is which one is better. Clearly dbscan and optics are not the choice.

Let’s see which one is better among kmeans, kmedoids and agglomerative, based on their silhouette coefficient.

The mean silhouette coefficient values for the three selected clustering algorithms.

From the above scores, it looks like kmeans is a better choice for the selected seed data.

Well, instead of mean Silhouette coefficient, let’s look at the individual Silhouette coefficients for each sample in the data for their assigned cluster.

Silhouette Coefficient for each sample

The silhouette coefficients can be calculated with any distance metric, such as the Euclidean distance or the Manhattan distance. Let’s move on a compute the Silhouette Score or Silhouette Coefficient for each sample and plot them to see how each sample is contributing towards the mean value for our trained clustering algorithms.

  • silhouette_samples from scikit-learn can be used for this purpose which used Euclidean distance by default.

So, we are all set and have Silhouette Coefficient for each sample for the three algorithms. Let’s move on and write a utility function to plot these silhouette values along with clustered data points for the selected algorithms.

IMPORTANT note for the plot below >> Please note, Area vs Kernel groove length plot is just a representative of data, the model is trained on seven features including these two. The overlapping in the scatter plot does not mean the data points are not clearly separated. For true representation, we need to train the clustering models on two selected features only.

K-means

Let’s get the plots using our recently written function. We need to pass computed Silhouette coefficients for individual samples and the respective trained clustering algorithms from the previous lecture.

On the left, no data point have negative silhouette coefficient. For the scatter plot on right, please read the note above. Remember, the model is trained on seven features including the two in the scatter plot. The overlap points are because of 2-D plot only.

K-medoids

Negative values of Silhouette coefficients in cluster 0 and 1 (left plot) suggest those points are wrongly assigned by k-medoids algorithm.

Agglomerative Clustering

Negative values of Silhouette coefficients in all three clusters suggests that some points are wrongly assigned to the respective cluster.

Keeping these points in mind:

  • 1 suggests that the clusters are well apart from each other and clearly distinguished
  • 0 suggests that the clusters are indifferent, or we can say that the distance between clusters is not significant
  • -1 suggests that the clusters are assigned in the wrong way

>> What do you conclude?
>> Which algorithms will be your choice?
>> Do you want to get the silhouette plots for DBSCAN and OPTICS, if so, what do you concluded from those plots?

Good to know:

Silhouette information evaluates the quality of the partition detected by a clustering technique. Since it is based on a measure of distance between the clustered observations, its standard formulation is not adequate when a density-based clustering technique is used.

This study on “Density-based Silhouette diagnostics for clustering methods” propose a suitable modification of the Silhouette information aimed at evaluating the quality of clusters in a density-based framework which is based on estimation of the data posterior probabilities of belonging to the clusters.

So, this was all for this article.

Good luck and Keep practicing!

*******************************************************************

💐Click here to FOLLOW ME for new contents💐

🌹Keep practicing to brush-up & add new skills🌹

✅🌹💐💐💐🌹✅ Please clap and share >> you can help us to reach to someone who is struggling to learn these concepts.✅🌹💐💐💐🌹✅

Good luck!

See you in the next lecture on “A47:…………….”.

Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:

**************************************************************************************************************************************

About Dr. Junaid Qazi:

Dr. Junaid Qazi is a subject matter specialist, data science & machine learning consultant, and a team builder. He is a professional development coach, mentor, author, technical writer, and invited speaker. Dr. Qazi can be reached for consulting projects, technical writing and/or professional development trainings via LinkedIn.

**************************************************************************************************************************************

--

--

Junaid Qazi, PhD

We offer professional development, corporate training, consulting, curriculum and content development in Data Science, Machine Learning and Blockchain.