A comparative study between Partition Based Clustering and Hierarchical Clustering.

Sanyamjain
5 min readNov 4, 2020

--

So what exactly is clustering?

You must have heard a bad fish deteriorates the whole pond, well it’s just the same where we group the bad fishes from the good ones based on their certain attributes such as color, size or smell.

There are various types of clustering concepts but for the sake of simplicity we’re gonna be discussing about two, those are:

Partition Based Clustering

And

Hierarchical Clustering

Let’s dive into some examples for a better understanding.

Here we’ll be taking a not so new dataset, the breast cancer detection one.

Starting from loading our favorite libraries.

import numpy as np

from sklearn.datasets import load_breast_cancer

import matplotlib.pyplot as plt

%matplotlib inline

And loading the dataset:

data = load_breast_cancer()

X = data.data

y = data.target

Partition Base Clustering:

We’re gonna be using KMeans clustering to demonstrate this concept since it’s the most popular one.

This algorithm is a centroid based clustering algorithm, which starts with some assumption about the total clusters in the data and with random centers assigned to each of the clusters.

It then reassigns each data point to the center closest to it, using Euclidean distance as the distance metric.

After each reassignment, it recalculates the center of that cluster and this process repeats until reassignment of data points doesn’t change the cluster centers.

Hence the category is assigned as per the closest centroid to the data instance to be predicted.

from sklearn.cluster import KMeans

km = KMeans(n_clusters=2)

km.fit(X)

labels = km.labels_

centers = km.cluster_centers_

Note that here we’d chosen the number of clusters being 2, well we’re a bit lucky since we know the number of classes we’re dealing with here. Which is not the case in the real world scenario. In that case we ought to use techniques like the Elbow method and the Silhouette score.

The labels obtained are as

[0 0 0 1 0 1 0 1 1 1]

Since the dataset has 30 features, it’ll be easier if we could condense it to let’s say two for in order to visualize more elegantly. And when it comes to dimensionality reduction the first algorithm that comes to our mind is definitely PCA.

So here’s the resulting clusters by the PCA

As you may see, the algorithm was able to condense various features in order for it to form clusters. But it isn’t enough as there are still values that either overlap or are misclassified.

This might be because we’d defined clusters ourselves, because the reality is in the real world scenario we do not have labelled data for us to compare the algorithm.

Hence, one of the caveats of using the K-means algorithm is to use it in the case where we have some idea about the total number of clusters that may exist in the data.

Hierarchical Clustering

Let’s see how this implies with Hierarchical clustering concept.

In scikit-learn we have a multitude of interfaces like the AgglomerativeClustering class to perform hierarchical clustering. Agglomerative clustering is hierarchical clustering using a bottom up approach i.e. each observation starts in its own cluster and clusters are successively merged together. The merging criteria can be used from a candidate set of linkages; the selection of linkage governs the merge strategy. Some examples of linkage criteria are Ward, Complete linkage, Average linkage and so on.

from scipy.cluster.hierarchy import dendrogram, linkage

import numpy as np

np.set_printoptions(suppress=True)

Z = linkage(X, ‘ward’)

If you have n data points, the linkage matrix, Z will be having a shape of (n — 1) x 4 where Z[i] will tell us which clusters were merged at the ith iteration.

Each row has four elements, the first two elements are either data point identifiers or cluster labels (in the later parts of the matrix once multiple data points are merged), the third element is the cluster distance between the first two elements (either data points or clusters), and the last element is the total number of elements\data points in the cluster once the merge is complete.

I’ll be using dendrogram to plot the graph in order to visualize it.

It might feel overwhelming to you, let’s try to understand it.

In the dendrogram, we can see how each data point starts as an individual cluster and slowly starts getting merged with other data points to form clusters. On a high level from the colors and the dendrogram, you can see that the model has correctly identified two major clusters if you consider a distance metric of around 10000 or above. Leveraging this distance, we can get the cluster labels using the following code.

from scipy.cluster.hierarchy import fcluster

max_dist = 10000

hc_labels = fcluster(Z, max_dist, criterion=’distance’)

Conclusion:

We definitely see two distinct clusters but there is more overlap as compared to the K-means method between the two clusters and we have more mislabeled instances. However, do take a note of the label numbers; here we have 1 and 2 as the label values. This is just to reinforce the fact that the label values are just to distinguish the clusters and don’t mean anything. The advantage of this method is that you do not need to input the number of clusters beforehand and the model tries to find it from the underlying data.

At last, i want to give credits to this amazing book Practical Machine learning with Python that I’ve been reading and want to mention that all the above knowledge and observations are inspired from it and I’m just a medium who’s sharing it with you all.

--

--

Sanyamjain
Sanyamjain

Written by Sanyamjain

0 Followers

Data Scientist at Mapmyindia | LinkedIn: https://www.linkedin.com/in/sanyam-jain-4727b3151

No responses yet