# Hierarchical Clustering - Agglomerative

Often data is produced by a process that has some natural hierarchy. If you have a clustering problem where this is true, hierarchical clustering works really well. Find out more in this Python Notebook.

HIERARCHICAL

# Hierarchical Clustering - Agglomerative Clustering

Clustering is an unsupervised task. In other words, we don’t have any labels or targets. This is common when you receive questions like “what can we do with this data?” or “can you tell me the characteristics of this data?”.

There are quite a few different ways of performing clustering, but one way is to form clusters hierarchically. You can form a hierarchy in two ways: start from the top and split, or start from the bottom and merge.

In this workshop we’re going to look at the latter, which is called agglomerative clustering.

Hierarchical techniques are really useful when you can assert that there is some natural tendency to build a heirarchy in your domain. For example, if you profiling people that use your application, then you might find that people tend to form a heirarchy.

# Usual imports
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import display


## Data

We’ll be using the by now familiar iris dataset again for this, because it has a natural hierarchy.

from sklearn import datasets

# import some data to play with
feat = iris.feature_names
X = iris.data[:, :2]  # we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
y = iris.target
y_name = ['Setosa', 'Versicolour', 'Virginica']


## Agglomerative clustering

Remember agglomerative clustering is the act of forming clusters from the bottom up.

We start with single observations as clusters, then iteratively assign them to the nearest cluster.

Eventually we end up with a number of clusters (which need to be specified in advance).

Let’s stick to “ward” linkage to define how to merge the clusters; it generally works pretty well.

from sklearn.cluster import AgglomerativeClustering
clustering.fit(X);


And again, let’s plot the data.

# MinMax scale the data so that it fits nicely onto the 0.0->1.0 axes of the plot.
from sklearn import preprocessing
X_plot = preprocessing.MinMaxScaler().fit_transform(X)

colours = 'rbg'
for i in range(X.shape[0]):
plt.text(X_plot[i, 0], X_plot[i, 1], str(clustering.labels_[i]),
color=colours[y[i]],
fontdict={'weight': 'bold', 'size': 9}
)

plt.xticks([])
plt.yticks([])
plt.axis('off')
plt.show()


In this plot, the numbers denote which cluster each observation has been assigned to.

The colours denote the original class.

I think you will agree that the clustering has done a pretty decent job and there are a few outliers.

• Try altering the number of clusters to 1, 3, others….

# Dendrograms

Dendrograms are hierarchical plots of clusters where the length of the bars represent the distance to the next cluster centre.

We can lean on our other general purpose data science library scipy to provide us with a method to plots dendrograms. Unfortunately we also have to use scipys linkage methods, rather than sklearns because of some expected parameters.

from scipy.cluster.hierarchy import dendrogram, linkage

figure = plt.figure(figsize=(7.5, 5))
dendrogram(
color_threshold=0,
)
plt.title('Hierarchical Clustering Dendrogram (Ward)')
plt.xlabel('sample index')
plt.ylabel('distance')
plt.tight_layout()
plt.show()


Wooohey that’s a lot of legs. Let’s cut a few off to be able to take a better look at the data…

figure = plt.figure(figsize=(7.5, 5))
dendrogram(
truncate_mode='lastp',  # show only the last p merged clusters
p=24,  # show only the last p merged clusters
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True,  # to get a distribution impression in truncated branches
)
plt.title('Hierarchical Clustering Dendrogram (Ward, aggrogated)')
plt.xlabel('sample index or (cluster size)')
plt.ylabel('distance')
plt.show()


Much better.

Ok, now we’ve honed our artistic skills, let’s put them to the test in the whiskey data set.

## Another look at the whiskey dataset

whiskey = pd.read_csv('https://s3.eu-west-2.amazonaws.com/assets.winderresearch.com/data/whiskies.csv')
cols = ['Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco',
'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral']
X = whiskey[cols]
y = whiskey['Distillery']


BodySweetnessSmokyMedicinalTobaccoHoneySpicyWineyNuttyMaltyFruityFloral
0222002122222
1331004322332
2132002002232
3414400201210
4222001112311
0    Aberfeldy
1     Aberlour
2       AnCnoc
3       Ardbeg
4      Ardmore
Name: Distillery, dtype: object


• Advanced: Write an algorithm to pick a single whiskey from each main whiskey group, using only the data (truncate the data then pick observations from the AgglomerativeClustering.children_ field)