A45: Clustering — Unsupervised Machine Learning

k-means, k-medoids, Agglomerative — Hierarchical clustering, Dendrogram, DBSCAN, OPTICS

Junaid Qazi, PhD
15 min readMay 20, 2022

This article is a part of Data Science from Scratch — Can I to I Can”, A Lecture Notes Book Series. (click here to get your copy today!)

Click here for the previous article/lecture on “A44: Support Vector Machines (SVMs) vs Logistic Regression — Practice & Comparisons [complete project with code]”

💐Click here to FOLLOW ME for new contents💐

⚠️ In this lecture, we will be working with synthetic data to understand the working of famous k-means clustering algorithm. We will then use seed dataset to explore range of clustering algorithms including k-means, k-mediods, agglomerative, dbscan, optic and much more…!

✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”

🧘🏻‍♂️ 👉🎯 >> Stay calm and focused! >> 🧘🏻‍♂️ 👉🎯

Welcome back guys!

Let’s move on and learn key unsupervised algorithms, this article covers the following topics:

1. Clustering algorithms
2. Visualizing basic k-means clustering algorithm
3. The Seed dataset
4. K-means clustering — algorithm implementation
5. Finding best value for k elbow method
6. K-mediods clustering algorithm implementation
7. A quick comparison between k-means and k-medoids
8. Agglomerative — Hierarchical clustering algorithm
9. Dendrogram — Hierarchical clustering
10. DBSCAN — Density Based Spatial Clustering of Applications with Noise
11. OPTICS — Ordering Points To Identify the Clustering Structure
12. To do and readings clustering

1: Clustering algorithms

Clustering is a set of unsupervised machine learning techniques, which are employed to partition or cluster similar data points together in groups. As, we don't have expected output (the target) of each observation in unsupervised learning problems, the algorithm itself figure out these groups based on similarity among the observations.

Clustering helps to identify meaningfulness and/or usefulness of the data e.g.:

  • In medical filed, the results could be helpful to identify and group the patients based on their response to certain medical treatment.
  • Customer behavioral analysis and segmentation is extremely important in business for targeted marketing, which can be achieved using clustering techniques. A common example is grouping the customers based on their purchase history and advertise specific products to a specific group to achieve high return with low expense.

It's important for any business to know their customers, how they behave is a key to big success. Businesses want to know their customer’s interest, what time they visit the store, what they explore, their purchase power and much more………!

  • Can you images how important your credit card history could be?
  • Do you expect apple to invest in store opening in Africa?
  • What about luxury car dealership in poor countries?

There are several approaches to perform clustering, however partitioning, hierarchical and density-based clustering are widely used. Each of these approach has its own unique strengths and weaknesses.

Partitioning approach involve splitting of data into 'k' number of clusters and evaluating them by some criterion e.g. minimizing the squared error. The data is iteratively re-allocated to get better clusters. k-means is a standard method in this category, however k-medoids and CLARANS -- Clustering Large Applications based on RANdomized Search are also in line to use in several cases.

Hierarchical clustering is another widely used approach that build nested clusters by merging or splitting them successively. A tree or dendrogram is used to represent this hierarchy, which shows the relationship between allocate observations to the clusters. In the root, all observations are represented as a unique cluster whereas, a leaf represent only one sample. Generally, the technique follow bottom-up or agglomerative and top-down or divisivestrategy.

  • In agglomerative, each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
  • In divisive, all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

Density-based (DBSCAN) clustering algorithm was proposed in 1996 (original paper). This non-parametric algorithm can discover clusters of arbitrary shape while separating regions based on the densities of observations in the dataset. We don’t need to specify number of clusters and it can find any shape of cluster/s in the data. The algorithm is also helpful in identifying noise or outliers. (DBSCAN — Density-Based Spatial Clustering of Applications with Noise)

Remember, without constraints or controlled experiment, it is not trivial to train a best clustering algorithm for the data, and there is no best clustering algorithm.

Libraries to implement clustering algorithms — good to know and explore:

  • scikit-learn implements several commonly used clustering algorithms including k-means, hierarchical and DBSCAN in python for data science ecosystem.
  • scikit-learn-extra is an extension of scikit-learn and k-medoids can be implemented using this library.
  • PyClustering is another useful library for data mining, that implements CLARANS and several other clustering algorithms.

Let’s move on and learn by doing. First thing first, we need to import the required libraries.

2. Visualizing basic k-means clustering algorithm

Working of k-means
So, after a comprehensive overview, let’s move and see how these clustering algorithm work. Perhaps, starting with simplest and widely use k-means clustering would be a best start.

We can generate data and manually implement k-means (as an example from clustering algorithms) for ten iterations using for loop and observe each step.

Synthetic data using make_blobs and some utility function for the figures below!
Code for the first 5 iteration in the figure below.
Code to generate 6 to 10 iterations in the figure below.
In blue, we have the data and three randomly initialized centroids, whereas, after 10 iterations, the k-means clustering algorithm assigns data-points to their respective cluster centroids.

In the figure above, we generated data as shown in the first scatter plot (in blue).

  • Let’s start with k=3 and the first step is to randomly initialize k number of centroids as shown in the first figure. In each iteration, we compute new centroids and assign the data points until there is no change in the centroids, once achieved the clusters will be the final clusters from the algorithm. (In the example above, we can see, there is no change in clusters after 9th iteration -- notice the white arrows!.)

3. The Seed dataset

Wheat seed dataset

K-means clustering, being the oldest algorithm for unsupervised learning, is reasonably straight forward to implement in python using scikit-learn. Let's work with a real dataset on clustering the wheat seed varieties based on their geometrical properties, the features are:

👉 A: Area
👉 P: Perimeter
👉 C: Compactness {C = 4𝜋A / P2}
👉 LK: Length of Kernel
👉 WK: Width of Kernel
👉 A_Coef: Asymmetry Coefficient
👉 LKG: Length of Kernel Groove

👉 Seed dataset dataset is available at UCI Machine Learning Repository. A soft X-ray technique and GRAINS package were used to construct all seven, real-valued attributes in the dataset.

👉 Please note, originally, this dataset is a supervised learning classification problem as the target class is given. Remember, the target class is unknown for unsupervised problem. We can ignore target class and make this dataset as an unsupervised problem in the project. However, the given target class will be helpful to see the difference. >>target: target class (0, 1, 2) — good for learning purpose to compare the outcomes, however, not required for clustering — <<

Let’s move on a read the dataset from the GitHub repository!

There is no missing data, and we have 210 observations, all numeric.

Well, it’s good to have readable names for the columns. We can rename them, right?

Renaming the columns for better understanding and readibility.

It looks nice as all the feature names are readable now!

Feature scaling

Let’s scale the features first!

We are using MinMaxScaler, try using StandardScaler and compare!

Grain size is usually helpful to categorize the seed, and bigger the size is, bigger the area and the perimeter will be. Make sense, right?

Let’s have a quick look on Area against Length of kernel grove and Asymmetry coeff, we might get some insights!

The plot on left suggest two clusters in the data, what about the plot on right?

Well, the above scatter plots could be helpful! From the plot on left suggested that there are two clusters in the data, however the one on right looks very scattered.

Having said, the target could be helpful in this learning project, you can see the actual distribution of clusters using the code given below!

REMEMBER in unsupervised learning, we don’t have target values, we cluster or group the data based on their similarities!

4. K-means clustering — algorithm implementation

Let’s move on and train our k-means clustering model, its super easy!

Let’s assume there are two clusters in the data and use k=2.

Let’s plot the clusters, we can write a utility function for this purpose that can be re-used later in the project (code reusability).

So, with k=2, the k-means clustering algorithms outcomes. Remember, our assumption if k=2 may not be a right choice, we need to find a best value for k!

Well, our assumption of k=2 may not be the right assumption. A good idea is to find the optimum value for k using elbow method (recall form KNN algorithm).

5. Finding best value for k — The elbow method

We are familiar with the elbow methods from KNN lecture, let’s use that to find the value of k for our clustering algorithm!

Elbow method is useful and suggest the best value of k, looks like k=3 is a good choice for the data under use!

From the plot above, it looks like k=3 is a better option. Let’s re-train out k-means clustering algorithm with k=3 and see how the clusters are distributed.

Clusters for k=3, the value of k is suggested by elbow method. (see above)

In the above example, our k-means clustering algorithm is trained on seven features, and looking at 2-D plots for selected features may not be very helpful. In the plot on left (Area vs Length of kernel grove), the overlapped points could be illusion and they may have a clear separation in the higher dimension. We can think about 3-D visualization.

Let’s use plotly to get interactive 3-D plot for the same features in the above two plots. Want to refresh plotly? Please explore Section-7 in the book-1.

In the above 3-D plot, looks like out k-means clustering algorithm is gives us reasonable clusters.

Moving forward, let’s try k-mediods, sister algorithm of k-means clustering.

6. K-mediods clustering — algorithm implementation

The k-medoids clustering algorithm is related to the k-means algorithm, however, medoids are always restricted to be members of the dataset whereas centroids are not necessarily the member of the dataset. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs.

Medoids, being restricted to be the members of the dataset, are the representative data-points/objects of a dataset (or a cluster) within a dataset whose average dissimilarity to all the data-points/objects in the cluster is minimal.

Let’s quickly get clustering from k-mediods algorithm!

👉 Please note, k-mediods algorithm in not a part of regular scikit-learn library, however it is available in its extension library scikit-learn-extra. You may want to install this extension.

The bigger circles for each cluster are the selected medoids that are the actual data points.

Notice that, in k-medoids algorithm, the medoids are actual data points (bigger circles with medoid datapoints in their centers)

#Try yourself -- Code for the interactive plot!
#kmedoids_labels=pd.Series(kmedoids.labels_,dtype=str) #converting to string to see the groups
#fig=px.scatter_3d(data_frame=X,x=”len_kernel_groove”,y=”area”,z=”asymmetry_coeff”,color=kmedoids_labels)
#fig.show()

==>IMPORTANT >> Remember, we have no targets in unsupervised machine learning, hence, there is no question to get confusion matrix. In this learning project, we do have targets and many aspiring data scientists asked about comparing the model results with the available targets. They may have thought that we should be able to get confusion matrix and classification report using targets and results from clustering algorithms, at least in this project. Well, this could be very misleading. Its unsupervised learning and our model may have clustered the data rightly, however given them different labels (e.g. all 1's in true may have been labeled 0's and vice versa). We can't even compare the labels between two clustering algorithms directly, we need to look whether both algorithms are putting same points in the same clusters, irrespective of the number/label they are assigning to the cluster.

Let’s look at the true labels and the assigned labels just for fun!

In the example above, all 70 observations actually belong to same cluster (actual target class is 0 as selected in the code above). We can see that both clustering algorithms (k-means and k-medoids) are putting majority of the datapoints in a same cluster. Cluster labels are subjective and could be different, which does not mean they are wrong. The question is: Are both algorithms grouping same data-points in the same clusters?

I hope the point is clear, and if you still want to see the confusion matrix and classification report, you can try yourself (NOT A GOOD IDEA AT ALL)

7. A quick comparison between k-means and k-medoids

  • k-means and k-medoids, both are partitional algorithms and break the dataset into groups. Both algorithms attempt to minimize the distance between points, labeled to be in a cluster and a point designated as the centre (centroid or medoid) of that cluster.
  • k-means attempts to minimize the total squared error, while k-medoids minimizes the sum of dissimilarities between points, labeled to be in a cluster and a point designated as the centre of that cluster.
  • In contrast to the k-means algorithm, k-medoids chooses data points as centres and can be used with arbitrary distances, while in k-means the centre of a clusters is not necessarily one of the input data points.

Fuzzy clustering is another clustering approach, also known as soft clustering or soft k-means method. In contrast to traditional k-means, each data point in fuzzy clustering can belong to more than one cluster. Gives every data point from the dataset a membership coefficient of every cluster. Image segmented by fuzzy clustering

8. Agglomerative — Hierarchical clustering algorithm

Agglomerative, another widely used unsupervised machine learning algorithm is a bottom-up approach in Hierarchical clustering.

Let’s use the same seed dataset and how this algorithm works. We are using k=3, as suggested by elbow method.

So, we have 210 observations or datapoints in total, as a bottom-up approach, the seeding started from each observation as a cluster (n_leaves=210) in the tree (dendrogram) and the observations merged into three final clusters (as we set n_clusters=3).

Let’s visualize (3-D) how the final clusters look like from agglomerative clustering algorithm.

3-D visualization of clusters create using agglomerative clustering algorithm

9. Dendrogram — Hierarchical clustering

If we are interested to get the dendrogram showing the hierarchy and the relationship between allocated observations to the clusters, the code below will be helpful.

Let’s write a function to plot dendrogram, we can call this function anytime to get our purpose done.

Please note, we need to setn_clusters=None and distance_threshold=0 while creating model instance for this purpose.

Let’s retrain the AgglomerativeClustering model with these parameters and plot dendrogram.

I encourage you to explore the documentation, there are several parameters that you may want to learn and understand. Try the code below and see what is going on.

#distance_threshold is the linkage distance threshold above which, #clusters will not be merged.
#agglm=AgglomerativeClustering(n_clusters=None,distance_threshold=0).fit(X)
#print("n_leaves: ",agglm.n_leaves_)
#print("n_clusters",agglm.n_clusters_)

#agglm_labels=pd.Series(agglm.labels_,dtype=str)
#converting to string to see the groups
#fig=px.scatter_3d(data_frame=X,x="len_kernel_groove",y="area",z="asymmetry_coeff",color=agglm_labels)
#fig.show()

Moving forward, let’s talk about density based cluster technique “DBSCAN”.

10. DBSCAN — Density-Based Spatial Clustering of Applications with Noise

Density based clustering can discover clusters of arbitrary shape by separating dense regions from the low density regions (we don’t need to pass number of clusters ‘k’ in this algorithm). The algorithm can identify outliers as well. Let’s see how DBSCAN works with the same seed dataset.

Following are the two very important parameters for DBSCANalgorithm.

  • eps: The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
  • min_samples: The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

Let’s import DBSCAN form scikkit-learn and train the model on seed data.

DBSCAN clustering is marking lots of data points as noise.

So, DBSCAN is marking lots of data points as noise in the above scatter plot. It might be a good idea to get a 3-D plot and once again plotly can be helpful.

Red are the data points marked as noise by the DBSCAN, its density based clustering!

Well, DBSCAN did not really work well with this data, with eps=0.2 and min_samples=15, the algorithm thinks there is lot's f noise (cluster -1).

11. OPTICS — Ordering Points To Identify the Clustering Structure

Detection of meaningful clusters (seen above) in data with varying densities is one of the major weakness in DBSCAN algorithm. OPTICS — (original paper) — , closely related to DBSCAN, helps to resolve this problem by finding core sample of high density and expands clusters from them. To achieve this, the points of the database are (linearly) ordered such that spatially closest points become neighbors in the ordering. The default cluster extraction with OPTICS looks at the steep slopes within the graph to find clusters, and the user can define what counts as a steep slope using the parameter xi.

In the algorithm, we can pass cluster_method={‘xi’,’dbscan’} to extract clusters using the calculated reachability and ordering — See the documentation for details

Moving forward, let’s see how OPTIC helps to cluster the seed data! We can pass min_samples=10 in the model instance for training.

We know that eps is maximum distance between two samples for one to be considered as in the neighborhood of the other. Now, clust is now our trained instance of OPTIC model. Well, we can use the required parameters (e.g. reachability, core distance, ordering) from our trained instance clust and look for range of eps using a simple for loop as given below. We can see that howeps effect the cluster forming.

eps=0.2 is giving three clusters, let's visualize in 3D using plotly!

So far, all done related to clustering. We have learned rage of unsupervised learning techniques and how they are helping to cluster similar data points. Every algorithm is not good for every data, and we need to explore the best option based on the given data, however, we should have all the possible tools in our skillset bucket!

I suggest, practice your skills and do some extra reading that will be very helpful (see the links below).

12. To do and readings — clustering

*******************************************************************

💐Click here to FOLLOW ME for new contents💐

🌹Keep practicing to brush-up & add new skills🌹

✅🌹💐💐💐🌹✅ Please clap and share >> you can help us to reach to someone who is struggling to learn these concepts.✅🌹💐💐💐🌹✅

Good luck!

See you in the next lecture on A46: Evaluating Clustering Algorithm — Unsupervised Machine Learning”.

Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:

**************************************************************************************************************************************

About Dr. Junaid Qazi:

Dr. Junaid Qazi is a subject matter specialist, data science & machine learning consultant, and a team builder. He is a professional development coach, mentor, author, technical writer, and invited speaker. Dr. Qazi can be reached for consulting projects, technical writing and/or professional development trainings via LinkedIn.

**************************************************************************************************************************************

--

--

Junaid Qazi, PhD

We offer professional development, corporate training, consulting, curriculum and content development in Data Science, Machine Learning and Blockchain.