A47: Clustering — A complex multi-cluster dataset

K-means, Agglomerative, Silhouette Coefficient, Silhouette Score, Unsupervised Learning, Density-based clustering, DBSCAN, OPTICS

Junaid Qazi, PhD
6 min readMay 27, 2022

This article is a part of Data Science from Scratch — Can I to I Can”, A Lecture Notes Book Series (Part-3). (click here to get your copy of Part-1 & Part-2 today)

Click here for the previous article/lecture on “A46: Evaluating Clustering Algorithm — Unsupervised Machine Learning”.

💐Click here to FOLLOW ME for new contents💐

⚠️ In this lecture, we will be working with a complex multi-cluster dataset.

✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”

🧘🏻‍♂️ 👉🎯 >> Stay calm and focused! >> 🧘🏻‍♂️ 👉🎯

In the previous lecture (A45 and A46), we learned about range of clustering algorithms and their evaluation using Silhouette Coefficient.

Let’s move on and work with a complex multi-cluster dataset and compare the performance of different clustering algorithms. In this lecture, we will also explore how epsilon parameter is important in density based clustering techniques. Remember, not every algorithm is good for every dataset!

Starting with some required imports!

# Required imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(font_scale=1.3) # setting font size for the whole notebook
sns.set_style(“white”) # if you want to set the style
# Setting display format to retina in matplotlib to see better quality images.
from IPython.display import set_matplotlib_formats
set_matplotlib_formats(‘retina’)
# Lines below are just to ignore warnings
import warnings
warnings.filterwarnings(‘ignore’)

Good to know the versions!

The dataset

Let’s read a multi-cluster data from git.

mulit_cluster_url=”””https://raw.githubusercontent.com/junaidqazi/\
DataSets_Practice_ScienceAcademy/master/multi_cluster_data.csv”””
df=pd.read_csv(mulit_cluster_url)
# Plotting data
plt.figure(figsize = (14, 6)); plt.scatter(df.a,df.b,c=’black’,alpha=0.3)
plt.title(“Multi-cluster dataset”); plt.xticks([]);plt.yticks([]);
So, we have multiple clusters in our dataset with different shapes along with some scattered points all around!

We can see that our data have five clusters and some scattered data points all around, as shown in the above scatter plot.

Feature Scaling

Feature scaling is always recommended, let’s get it done! See the importance of feature scalling in distance based algorithms, typically in KNN.

from sklearn.preprocessing import MinMaxScaler, StandardScaler
#Separating featurs
X=df.drop(‘t’,axis=1)
cols=X.columns
X = StandardScaler().fit_transform(X)
X=pd.DataFrame(X,columns=cols)

So, we have scaled features in X now!

Four cluster techniques — k-means, agglomerative, DBSCAN and OPTICS

If you want to understand how these clustering techniques work in details, revise this lecture A45: Clustering — Unsupervised Machine Learning.

Let’s train the respective instances for the four selected clustering algorithms and plot their results side-by-side.

from sklearn import cluster# Creating model instances and train on the data (KMeans, agglomerative, OPTIC and DBSCAN)# Fitting k-means with default parameters other than n_clusters
kmeans = cluster.KMeans(n_clusters=5).fit(X)
# Fitting agglomerative with dafault parameters other than n_clusters
agglm = cluster.AgglomerativeClustering(n_clusters=5).fit(X)
# Fitting DBSCAN — Remember, we don’t need n_clusters for dbscan
dbscan = cluster.DBSCAN().fit(X)
# Fitting OPTICS — Remember, we don’t need n_clusters for optics
optic = cluster.OPTICS().fit(X)
# Plotting results
f, ax = plt.subplots(nrows=2, ncols=2, sharey=True, figsize=(16,8))
ax[0][0].set_title(‘K Means (k=5) — Partitioning clustering’)
ax[0][0].scatter(df.a, df.b, c=kmeans.labels_, cmap=’rainbow’)
ax[0][1].set_title(“Agglomerative (k=5) — Hierarchical clustering”)
ax[0][1].scatter(df.a, df.b, c=agglm.labels_, cmap=’rainbow’)
ax[1][0].set_title(‘DBSCAN with default paremeters’)
ax[1][0].scatter(df.a, df.b, c=dbscan.labels_, cmap=’rainbow’)
ax[1][1].set_title(‘OPTICS with default paremeters’)
ax[1][1].scatter(df.a, df.b, c=optic.labels_, cmap=’rainbow’)
ax[0][0].set_xticks([]);ax[0][0].set_yticks([]);ax[0][1].set_xticks([]);ax[0][1].set_yticks([])
ax[1][0].set_xticks([]);ax[1][0].set_yticks([]);ax[1][1].set_xticks([]);ax[1][1].set_yticks([])
plt.tight_layout()
So, all algorithms fails to give appropriate clusters in the dataset! What is going wrong, at-least with density based techniques?

All four algorithms failed to cluster the data. Indeed, the problem is not for k-means and agglomerative clustering.

  • At-least, DBSCAN and/or OPTICS should work! right? -- However it is not working as well!

Well, we know DBSCAN has a problem in detecting meaningful clusters. Let’s try OPTICS and DBSCAN again with appropriate value of epsilon!

epsilon — an important parameter for DBSCAN and OPTICS

We may want to find the suitable value of epsilon -- The maximum distance between two samples for one to be considered as in the neighborhood of the other.

optic=cluster.OPTICS(cluster_method=’dbscan’).fit(X)
#try cluster_method=’xi’ — xi-steep method <shift+tab> for documentation!
print(“optic is our fitted clustering instance”)
optic is our fitted clustering instance

optic is out fitted clustering model in the data. We can use the following attributes to get the labels for range of epsilonvalues using a simple for loop.

  • optic.reachability_ is an array of the reachability distance for each data point calculated by OPTICS
  • optic.core_distances_ is an array of the distances at which data points become code
  • optic.ordering_ is an array of OPTICS ordered point indices

Please explore the documentation to learn more.

Moving forward, let’s get the labels assigned by the algorithm for different values of epsilon using a simple for loop.

Finding appropriate epsilon is very important!

So, epsilon values 0.1 and 0.15 give five clusters and mark some noise (-1). Let's fit DBSCAN and OPTICS instances using these values and plot the clusters again.

This time with appropriate value of epsilon, density based algorithms are working fine to cluster the data! We also see the detected noise in the dataset by algorithms!

Looks good now!

So, with appropriate value of epsilon, we can see that both DBSCAN and OPTICS are working great.

Please explore the rich documentation of available clustering algorithms on scikit-learn, there are several important parameters that can help to improve your final algorithm. This link is also helpful! form scikit-learn's documentation to explore clustering.

In the previous lecture, we learned about silhouette coefficient to evaluate clustering algorithm. The standard formulation of silhouette coefficient is not adequate when a density-based clustering technique in use and this article propose suitable modification. Still, I would suggest you to use the code from the previous lecture and get silhouette coefficient for the algorithms trained in this lecture to learn more.

*******************************************************************

For your information, the plots below is for DBSCAN from this lecture. You can see the cluster in blue is identified correctly, however silhouette coefficient for all individual samples is less than 0 (negative). Remember, knowing the data is very important, only evaluation matrices are not always sufficient!

The cluster in blue is identified correctly, however silhouette coefficient for all individual samples is less than 0 (negative). Remember, knowing the data is very important, only evaluation matrices are not always sufficient!

Good luck and Keep practicing!

*******************************************************************

💐Click here to FOLLOW ME for new contents💐

🌹Keep practicing to brush-up & add new skills🌹

✅🌹💐💐💐🌹✅ Please clap and share >> you can help us to reach to someone who is struggling to learn these concepts.✅🌹💐💐💐🌹✅

Good luck!

See you in the next lecture on “A48:…………….”.

Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:

**************************************************************************************************************************************

About Dr. Junaid Qazi:

Dr. Junaid Qazi is a subject matter specialist, data science & machine learning consultant, and a team builder. He is a professional development coach, mentor, author, technical writer, and invited speaker. Dr. Qazi can be reached for consulting projects, technical writing and/or professional development trainings via LinkedIn.

**************************************************************************************************************************************

--

--

Junaid Qazi, PhD

We offer professional development, corporate training, consulting, curriculum and content development in Data Science, Machine Learning and Blockchain.