A47: Clustering — A complex multi-cluster dataset
K-means, Agglomerative, Silhouette Coefficient, Silhouette Score, Unsupervised Learning, Density-based clustering, DBSCAN, OPTICS
This article is a part of “Data Science from Scratch — Can I to I Can”, A Lecture Notes Book Series (Part-3). (click here to get your copy of Part-1 & Part-2 today)
⚠️ In this lecture, we will be working with a complex multi-cluster dataset.
✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”
🧘🏻♂️ 👉🎯 >> Stay calm and focused! >> 🧘🏻♂️ 👉🎯
In the previous lecture (A45 and A46), we learned about range of clustering algorithms and their evaluation using Silhouette Coefficient.
Let’s move on and work with a complex multi-cluster dataset and compare the performance of different clustering algorithms. In this lecture, we will also explore how epsilon parameter is important in density based clustering techniques. Remember, not every algorithm is good for every dataset!
Starting with some required imports!
# Required imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(font_scale=1.3) # setting font size for the whole notebook
sns.set_style(“white”) # if you want to set the style
# Setting display format to retina in matplotlib to see better quality images.
from IPython.display import set_matplotlib_formats
set_matplotlib_formats(‘retina’)
# Lines below are just to ignore warnings
import warnings
warnings.filterwarnings(‘ignore’)
Good to know the versions!
The dataset
Let’s read a multi-cluster data from git.
mulit_cluster_url=”””https://raw.githubusercontent.com/junaidqazi/\
DataSets_Practice_ScienceAcademy/master/multi_cluster_data.csv”””
df=pd.read_csv(mulit_cluster_url)# Plotting data
plt.figure(figsize = (14, 6)); plt.scatter(df.a,df.b,c=’black’,alpha=0.3)
plt.title(“Multi-cluster dataset”); plt.xticks([]);plt.yticks([]);
We can see that our data have five clusters and some scattered data points all around, as shown in the above scatter plot.
Feature Scaling
Feature scaling is always recommended, let’s get it done! See the importance of feature scalling in distance based algorithms, typically in KNN.
from sklearn.preprocessing import MinMaxScaler, StandardScaler
#Separating featurs
X=df.drop(‘t’,axis=1)
cols=X.columns
X = StandardScaler().fit_transform(X)
X=pd.DataFrame(X,columns=cols)
So, we have scaled features in X now!
Four cluster techniques — k-means, agglomerative, DBSCAN and OPTICS
If you want to understand how these clustering techniques work in details, revise this lecture A45: Clustering — Unsupervised Machine Learning.
Let’s train the respective instances for the four selected clustering algorithms and plot their results side-by-side.
from sklearn import cluster# Creating model instances and train on the data (KMeans, agglomerative, OPTIC and DBSCAN)# Fitting k-means with default parameters other than n_clusters
kmeans = cluster.KMeans(n_clusters=5).fit(X)# Fitting agglomerative with dafault parameters other than n_clusters
agglm = cluster.AgglomerativeClustering(n_clusters=5).fit(X)# Fitting DBSCAN — Remember, we don’t need n_clusters for dbscan
dbscan = cluster.DBSCAN().fit(X)# Fitting OPTICS — Remember, we don’t need n_clusters for optics
optic = cluster.OPTICS().fit(X)# Plotting results
f, ax = plt.subplots(nrows=2, ncols=2, sharey=True, figsize=(16,8))
ax[0][0].set_title(‘K Means (k=5) — Partitioning clustering’)
ax[0][0].scatter(df.a, df.b, c=kmeans.labels_, cmap=’rainbow’)
ax[0][1].set_title(“Agglomerative (k=5) — Hierarchical clustering”)
ax[0][1].scatter(df.a, df.b, c=agglm.labels_, cmap=’rainbow’)
ax[1][0].set_title(‘DBSCAN with default paremeters’)
ax[1][0].scatter(df.a, df.b, c=dbscan.labels_, cmap=’rainbow’)
ax[1][1].set_title(‘OPTICS with default paremeters’)
ax[1][1].scatter(df.a, df.b, c=optic.labels_, cmap=’rainbow’)
ax[0][0].set_xticks([]);ax[0][0].set_yticks([]);ax[0][1].set_xticks([]);ax[0][1].set_yticks([])
ax[1][0].set_xticks([]);ax[1][0].set_yticks([]);ax[1][1].set_xticks([]);ax[1][1].set_yticks([])
plt.tight_layout()
All four algorithms failed to cluster the data. Indeed, the problem is not for k-means
and agglomerative
clustering.
- At-least,
DBSCAN
and/orOPTICS
should work! right? -- However it is not working as well!
Well, we know DBSCAN has a problem in detecting meaningful clusters. Let’s try OPTICS and DBSCAN again with appropriate value of epsilon
!
epsilon — an important parameter for DBSCAN
and OPTICS
We may want to find the suitable value of epsilon
-- The maximum distance between two samples for one to be considered as in the neighborhood of the other
.
optic=cluster.OPTICS(cluster_method=’dbscan’).fit(X)
#try cluster_method=’xi’ — xi-steep method <shift+tab> for documentation!
print(“optic is our fitted clustering instance”)optic is our fitted clustering instance
optic
is out fitted clustering model in the data. We can use the following attributes to get the labels for range of epsilon
values using a simple for loop.
optic.reachability_
is an array of the reachability distance for each data point calculated by OPTICSoptic.core_distances_
is an array of the distances at which data points become codeoptic.ordering_
is an array of OPTICS ordered point indices
Please explore the documentation to learn more.
Moving forward, let’s get the labels assigned by the algorithm for different values of epsilon
using a simple for loop.
So, epsilon
values 0.1 and 0.15 give five clusters and mark some noise (-1). Let's fit DBSCAN
and OPTICS
instances using these values and plot the clusters again.
Looks good now!
So, with appropriate value of epsilon
, we can see that both DBSCAN
and OPTICS
are working great.
Please explore the rich documentation of available clustering algorithms on scikit-learn
, there are several important parameters that can help to improve your final algorithm. This link is also helpful! form scikit-learn's documentation to explore clustering.
In the previous lecture, we learned about silhouette coefficient to evaluate clustering algorithm. The standard formulation of silhouette coefficient is not adequate when a density-based clustering technique in use and this article propose suitable modification. Still, I would suggest you to use the code from the previous lecture and get silhouette coefficient for the algorithms trained in this lecture to learn more.
*******************************************************************
For your information, the plots below is for DBSCAN from this lecture. You can see the cluster in blue is identified correctly, however silhouette coefficient for all individual samples is less than 0 (negative). Remember, knowing the data is very important, only evaluation matrices are not always sufficient!
Good luck and Keep practicing!
*******************************************************************
💐Click here to FOLLOW ME for new contents💐
🌹Keep practicing to brush-up & add new skills🌹
✅🌹💐💐💐🌹✅ Please clap and share >> you can help us to reach to someone who is struggling to learn these concepts.✅🌹💐💐💐🌹✅
Good luck!
See you in the next lecture on “A48:…………….”.
Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:
- Books on leanpub
- SkillShare link (two free months for new subscribers)
- Free on YouTube
- ScienceAcademy
- https://karobarklinik.com — for all your digital needs!
**************************************************************************************************************************************
Dr. Junaid Qazi is a subject matter specialist, data science & machine learning consultant, and a team builder. He is a professional development coach, mentor, author, technical writer, and invited speaker. Dr. Qazi can be reached for consulting projects, technical writing and/or professional development trainings via LinkedIn.
**************************************************************************************************************************************