This is a tutorial on how to use scipy's hierarchical clustering.

One of the benefits of hierarchical clustering is that you don't need to already know the number of clusters k in your data in advance. Sadly, there doesn't seem to be much documentation on how to actually use scipy's hierarchical clustering to make an informed decision and then retrieve the clusters.

In the following I'll explain:

- how to use scipy's hierarchical clustering
- how to plot a nice dendrogram from it
- how to use the dendrogram to select a distance cut-off (aka determining the number of clusters k in your data)
- how to retrieve the k clusters
- how to visualize the clusters (2D case)

## Other works:¶

Some short shameless self-advertising:

- I teach machines to associate like humans. In that project I used hierarchical clustering to group similar learned graph patterns together.
- I'm always searching for good students, contact me.

## Naming conventions:¶

Before we start, as i know that it's easy to get lost, some naming conventions:

- X samples (n x m array), aka data points or "singleton clusters"
- n number of samples
- m number of features
- Z cluster linkage array (contains the hierarchical clustering information)
- k number of clusters

So, let's go.

## Imports and Setup¶

```
# needed imports
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
import numpy as np
```

```
# some setting for this notebook to actually show the graphs inline
# you probably won't need this
%matplotlib inline
np.set_printoptions(precision=5, suppress=True) # suppress scientific float notation
```

## Generating Sample Data¶

You'll obviously not need this step to run the clustering if you have own data.

The only thing you need to make sure is that you convert your data into a matrix X with n samples and m features, so that X.shape == (n, m).

```
# generate two clusters: a with 100 points, b with 50:
np.random.seed(4711) # for repeatability of this tutorial
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
X = np.concatenate((a, b),)
print X.shape # 150 samples with 2 dimensions
plt.scatter(X[:,0], X[:,1])
plt.show()
```

## Perform the Hierarchical Clustering¶

Now that we have some very simple sample data, let's do the actual clustering on it:

```
# generate the linkage matrix
Z = linkage(X, 'ward')
```

Done. That was pretty simple, wasn't it?

Well, sure it was, this is python ;), but what does the weird 'ward' mean there and how does this actually work?

As the scipy linkage docs tell us, 'ward' is one of the methods that can be used to calculate the distance between newly formed clusters. 'ward' causes linkage() to use the Ward variance minimization algorithm.

I think it's a good default choice, but it never hurts to play around with some other common linkage methods like 'single', 'complete', 'average', ... and the different distance metrics like 'euclidean' (default), 'cityblock' aka Manhattan, 'hamming', 'cosine'... if you have the feeling that your data should not just be clustered to minimize the overall intra cluster variance in euclidean space. For example, you should have such a weird feeling with long (binary) feature vectors (e.g., word-vectors in text clustering).

As you can see there's a lot of choice here and while python and scipy make it very easy to do the clustering, it's you who has to understand and make these choices. If i find the time, i might give some more practical advice about this, but for now i'd urge you to at least read up on the linked methods and metrics to make a somewhat informed choice. Another thing you can and should definitely do is check the Cophenetic Correlation Coefficient of your clustering with help of the cophenet() function. This (very very briefly) compares (correlates) the actual pairwise distances of all your samples to those implied by the hierarchical clustering. The closer the value is to 1, the better the clustering preserves the original distances, which in our case is pretty close:

```
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist
c, coph_dists = cophenet(Z, pdist(X))
c
```

No matter what method and metric you pick, the linkage() function will use that method and metric to calculate the distances of the clusters (starting with your n individual samples (aka data points) as singleton clusters)) and in each iteration will merge the two clusters which have the smallest distance according the selected method and metric. It will return an array of length n - 1 giving you information about the n - 1 cluster merges which it needs to pairwise merge n clusters. Z[i] will tell us which clusters were merged in the i-th iteration, let's take a look at the first two points that were merged:

```
Z[0]
```

We can see that ach row of the resulting array has the format [idx1, idx2, dist, sample_count].

In its first iteration the linkage algorithm decided to merge the two clusters (original samples here) with indices 52 and 53, as they only had a distance of 0.04151. This created a cluster with a total of 2 samples.

```
Z[1]
```

In the second iteration the algorithm decided to merge the clusters (original samples here as well) with indices 14 and 79, which had a distance of 0.04914. This again formed another cluster with a total of 2 samples.

The indices of the clusters until now correspond to our samples. Remember that we had a total of 150 samples, so indices 0 to 149. Let's have a look at the first 20 iterations:

```
Z[:20]
```

We can observe that until iteration 13 the algorithm only directly merged original samples. We can also observe the monotonic increase of the distance.

In iteration 13 the algorithm decided to merge cluster indices 62 with 152. If you paid attention the 152 should astonish you as we only have original sample indices 0 to 149 for our 150 samples. All indices idx >= len(X) actually refer to the cluster formed in Z[idx - len(X)].

This means that while idx 149 corresponds to X[149] that idx 150 corresponds to the cluster formed in Z[0], idx 151 to Z[1], 152 to Z[2], ...

Hence, the merge iteration 13 merged sample 62 to our samples 33 and 68 that were previously merged in iteration 2 (152 - 2).

Let's check out the points coordinates to see if this makes sense:

```
X[[33, 68, 62]]
```

Seems pretty close, but let's plot the points again and highlight them:

```
idxs = [33, 68, 62]
plt.figure(figsize=(10, 8))
plt.scatter(X[:,0], X[:,1]) # plot all points
plt.scatter(X[idxs,0], X[idxs,1], c='r') # plot interesting points in red again
plt.show()
```

We can see that the 3 red dots are pretty close to each other, which is a good thing.

The same happened in iteration 14 where the alrogithm merged indices 41 to 15 and 69:

```
idxs = [33, 68, 62]
plt.figure(figsize=(10, 8))
plt.scatter(X[:,0], X[:,1])
plt.scatter(X[idxs,0], X[idxs,1], c='r')
idxs = [15, 69, 41]
plt.scatter(X[idxs,0], X[idxs,1], c='y')
plt.show()
```

Showing that the 3 yellow dots are also quite close.

And so on...

We'll later come back to visualizing this, but now let's have a look at what's called a dendrogram of this hierarchical clustering first:

## Plotting a Dendrogram¶

A dendrogram is a visualization in form of a tree showing the order and distances of merges during the hierarchical clustering.

```
# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
Z,
leaf_rotation=90., # rotates the x axis labels
leaf_font_size=8., # font size for the x axis labels
)
plt.show()
```

(right click and "View Image" to see full resolution)

If this is the first time you see a dendrogram, it's probably quite confusing, so let's take this apart...

- On the x axis you see labels. If you don't specify anything else they are the indices of your samples in X.
- On the y axis you see the distances (of the 'ward' method in our case).

Starting from each label at the bottom, you can see a vertical line up to a horizontal line. The height of that horizontal line tells you about the distance at which this label was merged into another label or cluster. You can find that other cluster by following the other vertical line down again. If you don't encounter another horizontal line, it was just merged with the other label you reach, otherwise it was merged into another cluster that was formed earlier.

Summarizing:

- horizontal lines are cluster merges
- vertical lines tell you which clusters/labels were part of merge forming that new cluster
- heights of the horizontal lines tell you about the distance that needed to be "bridged" to form the new cluster

You can also see that from distances > 25 up there's a huge jump of the distance to the final merge at a distance of approx. 180. Let's have a look at the distances of the last 4 merges:

```
Z[-4:,2]
```

Such distance jumps / gaps in the dendrogram are pretty interesting for us. They indicate that something is merged here, that maybe just shouldn't be merged. In other words: maybe the things that were merged here really don't belong to the same cluster, telling us that maybe there's just 2 clusters here.

Looking at indices in the above dendrogram also shows us that the green cluster only has indices >= 100, while the red one only has such < 100. This is a good thing as it shows that the algorithm re-discovered the two classes in our toy example.

In case you're wondering about where the colors come from, you might want to have a look at the color_threshold argument of dendrogram(), which as not specified automagically picked a distance cut-off value of 70 % of the final merge and then colored the first clusters below that in individual colors.

### Dendrogram Truncation¶

As you might have noticed, the above is pretty big for 150 samples already and you probably have way more in real scenarios, so let me spend a few seconds on highlighting some other features of the dendrogram() function:

```
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
Z,
truncate_mode='lastp', # show only the last p merged clusters
p=12, # show only the last p merged clusters
show_leaf_counts=False, # otherwise numbers in brackets are counts
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True, # to get a distribution impression in truncated branches
)
plt.show()
```

The above shows a truncated dendrogram, which only shows the last p=12 out of our 149 merges.

First thing you should notice are that most labels are missing. This is because except for X[40] all other samples were already merged into clusters before the last 12 merges.

The parameter show_contracted allows us to draw black dots at the heights of those previous cluster merges, so we can still spot gaps even if we don't want to clutter the whole visualization. In our example we can see that the dots are all at pretty small distances when compared to the huge last merge at a distance of 180, telling us that we probably didn't miss much there.

As it's kind of hard to keep track of the cluster sizes just by the dots, dendrogram() will by default also print the cluster sizes in brackets () if a cluster was truncated:

```
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index or (cluster size)')
plt.ylabel('distance')
dendrogram(
Z,
truncate_mode='lastp', # show only the last p merged clusters
p=12, # show only the last p merged clusters
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True, # to get a distribution impression in truncated branches
)
plt.show()
```

We can now see that the right most cluster already consisted of 33 samples before the last 12 merges.

### Eye Candy¶

Even though this already makes for quite a nice visualization, we can pimp it even more by also annotating the distances inside the dendrogram by using some of the useful return values dendrogram():

```
def fancy_dendrogram(*args, **kwargs):
max_d = kwargs.pop('max_d', None)
if max_d and 'color_threshold' not in kwargs:
kwargs['color_threshold'] = max_d
annotate_above = kwargs.pop('annotate_above', 0)
ddata = dendrogram(*args, **kwargs)
if not kwargs.get('no_plot', False):
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index or (cluster size)')
plt.ylabel('distance')
for i, d, c in zip(ddata['icoord'], ddata['dcoord'], ddata['color_list']):
x = 0.5 * sum(i[1:3])
y = d[1]
if y > annotate_above:
plt.plot(x, y, 'o', c=c)
plt.annotate("%.3g" % y, (x, y), xytext=(0, -5),
textcoords='offset points',
va='top', ha='center')
if max_d:
plt.axhline(y=max_d, c='k')
return ddata
```

```
fancy_dendrogram(
Z,
truncate_mode='lastp',
p=12,
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True,
annotate_above=10, # useful in small plots so annotations don't overlap
)
plt.show()
```

## Selecting a Distance Cut-Off aka Determining the Number of Clusters¶

As explained above already, a huge jump in distance is typically what we're interested in if we want to argue for a certain number of clusters. If you have the chance to do this manually, i'd always opt for that, as it allows you to gain some insights into your data and to perform some sanity checks on the edge cases. In our case i'd probably just say that our cut-off is 50, as the jump is pretty obvious:

```
# set cut-off to 50
max_d = 50 # max_d as in max_distance
```

Let's visualize this in the dendrogram as a cut-off line:

```
fancy_dendrogram(
Z,
truncate_mode='lastp',
p=12,
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True,
annotate_above=10,
max_d=max_d, # plot a horizontal cut-off line
)
plt.show()
```

As we can see, we ("surprisingly") have two clusters at this cut-off.

In general for a chosen cut-off value max_d you can always simply count the number of intersections with vertical lines of the dendrogram to get the number of formed clusters. Say we choose a cut-off of max_d = 16, we'd get 4 final clusters:

```
fancy_dendrogram(
Z,
truncate_mode='lastp',
p=12,
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True,
annotate_above=10,
max_d=16,
)
plt.show()
```

### Automated Cut-Off Selection (or why you shouldn't rely on this)¶

Now while this manual selection of a cut-off value offers a lot of benefits when it comes to checking for a meaningful clustering and cut-off, there are cases in which you want to automate this.

The problem again is that there is no golden method to pick the number of clusters for all cases (which is why i think the investigative & backtesting manual method is preferable). Wikipedia lists a couple of common methods. Reading this, you should realize how different the approaches and how vague their descriptions are.

I honestly think it's a really bad idea to just use any of those methods, unless you know the data you're working on really really well.

#### Inconsistency Method¶

For example, let's have a look at the "inconsistency" method, which seems to be one of the defaults for the fcluster() function in scipy.

The question driving the inconsistency method is "what makes a distance jump a jump?". It answers this by comparing each cluster merge's height h to the average avg and normalizing it by the standard deviation std formed over the depth previous levels:

$$inconsistency = \frac{h - avg}{std}$$The following shows a matrix of the avg, std, count, inconsistency for each of the last 10 merges of our hierarchical clustering with depth = 5

```
from scipy.cluster.hierarchy import inconsistent
depth = 5
incons = inconsistent(Z, depth)
incons[-10:]
```

Now you might be tempted to say "yay, let's just pick 5" as a limit in the inconsistencies, but look at what happens if we set depth to 3 instead:

```
depth = 3
incons = inconsistent(Z, depth)
incons[-10:]
```

Oups! This should make you realize that the inconsistency values heavily depend on the depth of the tree you calculate the averages over.

Another problem in its calculation is that the previous d levels' heights aren't normally distributed, but expected to increase, so you can't really just treat the current level as an "outlier" of a normal distribution, as it's expected to be bigger.

#### Elbow Method¶

Another thing you might see out there is a variant of the "elbow method". It tries to find the clustering step where the acceleration of distance growth is the biggest (the "strongest elbow" of the blue line graph below, which is the highest value of the green graph below):

```
last = Z[-10:, 2]
last_rev = last[::-1]
idxs = np.arange(1, len(last) + 1)
plt.plot(idxs, last_rev)
acceleration = np.diff(last, 2) # 2nd derivative of the distances
acceleration_rev = acceleration[::-1]
plt.plot(idxs[:-2] + 1, acceleration_rev)
plt.show()
k = acceleration_rev.argmax() + 2 # if idx 0 is the max of this we want 2 clusters
print "clusters:", k
```

While this works nicely in our simplistic example (the green line takes its maximum for k=2), it's pretty flawed as well.

One issue of this method has to do with the way an "elbow" is defined: you need at least a right and a left point, which implies that this method will never be able to tell you that all your data is in one single cluster only.

Another problem with this variant lies in the np.diff(Z[:, 2], 2) though. The order of the distances in Z[:, 2] isn't properly reflecting the order of merges within one branch of the tree. In other words: there is no guarantee that the distance of Z[i] is contained in the branch of Z[i+1]. By simply computing the np.diff(Z[:, 2], 2) we assume that this doesn't matter and just compare distance jumps from different branches of our merge tree.

If you still don't want to believe this, let's just construct another simplistic example but this time with very different variances in the different clusters:

```
c = np.random.multivariate_normal([40, 40], [[20, 1], [1, 30]], size=[200,])
d = np.random.multivariate_normal([80, 80], [[30, 1], [1, 30]], size=[200,])
e = np.random.multivariate_normal([0, 100], [[100, 1], [1, 100]], size=[200,])
X2 = np.concatenate((X, c, d, e),)
plt.scatter(X2[:,0], X2[:,1])
plt.show()
```

As you can see we have 5 clusters now, but they have increasing variances... let's have a look at the dendrogram again and how you can use it to spot the problem:

```
Z2 = linkage(X2, 'ward')
plt.figure(figsize=(10,10))
fancy_dendrogram(
Z2,
truncate_mode='lastp',
p=30,
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True,
annotate_above=40,
max_d=170,
)
plt.show()
```

When looking at a dendrogram like this and trying to put a cut-off line somewhere, you should notice the very different distributions of merge distances below that cut-off line. Compare the distribution in the cyan cluster to the red, green or even two blue clusters that have even been truncated away. In the cyan cluster below the cut-off we don't really have any discontinuity of merge distances up to very close to the cut-off line. The two blue clusters on the other hand are each merged below a distance of 25, and have a gap of > 155 to our cut-off line.

The variant of the "elbow" method will incorrectly see the jump from 167 to 180 as minimal and tell us we have 4 clusters:

```
last = Z2[-10:, 2]
last_rev = last[::-1]
idxs = np.arange(1, len(last) + 1)
plt.plot(idxs, last_rev)
acceleration = np.diff(last, 2) # 2nd derivative of the distances
acceleration_rev = acceleration[::-1]
plt.plot(idxs[:-2] + 1, acceleration_rev)
plt.show()
k = acceleration_rev.argmax() + 2 # if idx 0 is the max of this we want 2 clusters
print "clusters:", k
```

The same happens with the inconsistency metric:

```
print inconsistent(Z2, 5)[-10:]
```

I hope you can now understand why i'm warning against blindly using any of those methods on a dataset you know nothing about. They can give you some indication, but you should always go back in and check if the results make sense, for example with a dendrogram which is a great tool for that (especially if you have higher dimensional data that you can't simply visualize anymore).

## Retrieve the Clusters¶

Now, let's finally have a look at how to retrieve the clusters, for different ways of determining k. We can use the fcluster function.

### Knowing max_d:¶

Let's say we determined the max distance with help of a dendrogram, then we can do the following to get the cluster id for each of our samples:

```
from scipy.cluster.hierarchy import fcluster
max_d = 50
clusters = fcluster(Z, max_d, criterion='distance')
clusters
```

### Knowing k:¶

Another way starting from the dendrogram is to say "i can see i have k=2" clusters. You can then use:

```
k=2
fcluster(Z, k, criterion='maxclust')
```

### Using the Inconsistency Method (default):¶

If you're really sure you want to use the inconsistency method to determine the number of clusters in your dataset, you can use the default criterion of fcluster() and hope you picked the correct values:

```
from scipy.cluster.hierarchy import fcluster
fcluster(Z, 8, depth=10)
```

# Visualizing Your Clusters¶

If you're lucky enough and your data is very low dimensional, you can actually visualize the resulting clusters very easily:

```
plt.figure(figsize=(10, 8))
plt.scatter(X[:,0], X[:,1], c=clusters, cmap='prism') # plot points with cluster dependent colors
plt.show()
```

I hope you enjoyed this tutorial. Feedback welcome ;)

## Further Reading:¶

- The scipy hierarchical clustering module: http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html
- The scipy distance computation docs: http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
- The scipy hierarchical clustering module docs often refer to the MATLAB docs saying that a certain function is similar to the MATLAB one. Here's their hierarchical clustering tutorial: http://mathworks.com/help/stats/hierarchical-clustering.html

Pedro MarcelinoThis is great stuff. Thanks for your sharing.

ElaineI am doing same as your code, but I don’t get the same for Z[0]. I am getting array([ 5.20000000e+01, 5.30000000e+01, 4.15105485e-02, 2.00000000e+00]). Do you know why?

Thank you.

joernPost authorare you sure you set the random number generator seed with random.seed(4711) immediately before creating the data?

HaifengI really like this tutorial! Thanks a lot for the sharing.

IreneThank you so much. There are no enough articles explaining how works SciPy Hierarchical Clustering for non-technical users.

RobThanks! Great post!

ReemThanks a lot so useful

Lorenzohey thanks for the great post. regarding the last plotting, from my understanding I specified k=3 so that:

clusters = fcluster(Z,3,criterion=’maxclust’)

Hence clusters is an array with a number specifying to what cluster each value in Z belongs to. However when I then do:

plt.scatter(Z[:,0],Z[:,1], c=clusters)

plt.show()

I get the following error:

ValueError: Color array must be two-dimensional

any help?

LorenzoOk found the problem, the issue is the size mismatch between the clusters array and the Z arrays.

joernPost authoryupp, sorry for the late reply… you just confused X and Z… X can be plotted ([cci]plt.scatter(X[:,0],X[:,1], c=clusters)[/cci]) and contains your points, Z the hierarchical cluster information.

John MillerI’m confused about the `linkage` method….

The documentation says that it takes as input a “condensed distance matrix”. My confusion is that it seems that in this tutorial we are giving `linkage` the sample data directly ( Z = linkage(X, ‘ward’) ), but I thought I would have to make a “distance matrix” first (eg with `pdist`).

Can you please clarify how to use `linkage`?

joernPost authorin a pythonic manner the linkage method can work with both as described in its docs: “A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of mm observation vectors in n dimensions may be passed as an m by n array.” http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html

John MillerI see, thank you!

So, the following two options would produce identical output:

# 1

Z = linkage(X, method=’ward’)

# 2

Y = pdist(X, method=’ward’)

Z = linkage(Y)

Kevin EgerThis is an awesome post! Really informative and exactly the intro I needed. How would the technique change if you were clustering 1D data? ie: if rather than having 2D data points, you instead had a calculated distance between a point and every other point (in an nxn matrix where n: point)?

Kevin EgerI guess to rephrase my question, I’ll describe what the matrix X would look like:

X.shape == (n, m)

-n: users (samples)

– users 0 – n-1

-m: # of comments (features)

– # of comments: integer value

joernPost authori used 2D examples (so m == 2) as they’re easiest to visualize, but everything above works for arbitrary m… especially for m == 1 as in your case. You’d have X.shape == (n, 1) …

Saurabh GoyalThank you for this amazing tutorial.

I wanted to ask if there is anyway to find out the features which is common for a particular cluster. For example in document classification i would like to know which words are responsible for the formation of a cluster (common keyword for a cluster).

joernPost authorI think there isn’t really “the one answer” for the general case, as finding common features in a cluster only works if your distance function is “decompose-able”. The clusters are formed based on distances between whole samples (so over all features).

Now in the document domain a typical distance is the cosine distance on tf-idf word vectors over documents, so you are lucky as that’s just the dot product (which is just the sum over the multiplication of all individual features (so the word counts) and can be decomposed). I’d probably approach this by forming “cluster documents”. This means that you use the ideas of tf-idf and just apply them to the clusters. Add all the document vectors (so all the word occurrences within a cluster) to get the clusters’ term frequencies (tf). Afterwards use them to form an “inverse cluster frequency” (icf) (equivalent to the “inverse document frequency” (idf)), with which you can build the tf-icf. Then per cluster just select the term with the top tf-icf and you should have your label.

That’s obviously only one solution and there’s a whole lot of research in the field of document analysis about Cluster labeling (https://en.wikipedia.org/wiki/Cluster_labeling).

O.rkaThanks for the tutorial this definitely helped get me started. I’m confused on the input Z of the linkage function. I have a similarity matrix, convert it to a dissimilarity matrix, then do I give linkage the np.triu, np.tril, or the square matrix? I posted my question on stackoverflow for figures and examples: http://stackoverflow.com/questions/36520043/triangle-vs-square-distance-matrix-for-hierarchical-clustering-python

MaryI keep getting “ValueError: negative dimensions are not allowed” when calling linkage(X, ‘ward’),

where X is an ndarray of shape (54000, 100) and represents 54000 observations in a 100 feature space. What gives?

joernPost authorwhat does [cci]print X.shape[/cci] tell you?

MaryIt says (54000L, 100L) Not sure if the L would have something to do with this.

joernPost author54000 by far isn’t long, so seems you’re doing something weird there…

can you manually invoke the pdist method and pass it to linkage like this?

Or maybe just re-create the array and check if it still has a weird “long” shape?

MaryI did:

and got:

ValueError: Valid methods when the raw observations are omitted are ‘single’, ‘complete’, ‘weighted’, and ‘average’.

Then I just replaced ‘ward’ by ‘average’ and got the initial ValueError of negatives

I initialize X as:

I tried the simple int too and np.int16 for the dtype parameter and I still get the L when calling shape. I used your example and got L too.

I have two versions of python installed on my system and I’m running this on python 2.7.11 I don’t know what else to say except that I’m going back to R.

MaryNvm. I tried the same initialization line in python 3 and it works fine. 😐

MithunAwesome tutorial, mate! You made the life a lot easier!

joernPost author😉

ArnoldHi, how can i get all sample indices within the cluster, how can you get it from scipy.cluster.hierarchy.dendrogram? i plan to create a dictionary: key = cluster_index, value = list of sample_index

joernPost authorArnoldHi,

thanks for the info.

i have also a question to the normalization of the data:

my feature matrix has the following format: X : array-like, shape = [n_samples, n_features]

how can i scale to zero mean and variance of 1? Im not sure if http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html is what i look for…

how does the dendrogram coloring work, it seems not each cluster gets its own color in the dendrogram? ploting the dendrogram: ddata = dendrogram(hc, color_threshold=cut_off, leaf_rotation=90.) plt.axhline(y=cut_off, linewidth=4, color=’r’, linestyle=’–‘) plt.show() here is the dendrogram: http://s32.postimg.org/nhjilnedh/Screen_Shot_2016_05_19_at_6_06_27_PM.png

from the dendogram i know that cluster 1 which consists of sample 1,2,3 … how can i figure out which feature mostly represents cluster 1? is there anything else i tell about cluster 1?

joernPost authoryupp, you can use that scaling function to scale…

coloring: for each cluster beneath the threshold it picks a different color

feature representation of the clusters depends on the distance function you use:

https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/#comment-3648

ArnoldHi,

for the following image you said there are 5 clusters.

see plot after “As you can see we have 5 clusters now, but they have increasing variances… let’s have a look at the dendrogram again and how you can use it to spot the problem:”

but the image shows only 4 different color. why is that?

how can you extract the most important feature/s forming a cluster? assume i have 30 features.

Thanks,

Arnold

joernPost authorthat has to do with the color threshold parameter in the [cci]fancy_dendrogram[/cci]. If you look closely you will see that the vertical lines crossing the limit line are always blue… Just that in the case of the very small clusters they are completely collapsed in the shown dendrogram so that you can only see them as blue lines.

wrt. the features have a look at the answers above.

NghiaThank you for the amazing tutorial.

I have a question regards to the max_d values in fcluster function. How can we define acceptable max_d’s values?

joernPost authorone of the main points in the post is to show that there isn’t a single definition of max_d that makes sense in all cases… that’s what you need the dendrogram for: to look at it and understand what different cut-offs will do to your data.

In general you should look for “jumps” in the distances, but be suspicious about imbalances of the distributions (a mixture of very tight clusters and very wide ones). A single selection of max_d might not be enough in case your cluster distributions are skewed. You will see this in the dendrogram when you can’t simply put a horizontal line to cut through all the huge jumps of all merges.

JoLet me first thank you for this great tutorial !

However something still bothers me, I’ll try to make it simple.

I have 5 arrays: 0, 1, 2, 3, 4

The corresponding linkage output is as follow:

[ [ 0. 3. 0.03376334 2. ]

[ 2. 1. 0.03924361 2. ]

[ 4. 7. 0.04020413 3. ]]

How am I supposed to know which of first cluster (“0” & “3”) or second cluster (“2” & “1”) is the one named “7”, to which the array “4” links ?

Thanks in advance for your consideration,

Jo

joernPost authorfrom the post: All indices idx >= len(X) actually refer to the cluster formed in Z[idx – len(X)]

your example seems constructed (and wrong), as a clustering of 5 points requires 4 pairwise merges… wrong because the last merge seems to be self-referring.

JoI did omit he last line of the linkage output… my bad.

Thank you for your answer, it was indeed already explain in your post… I’ll read better next time !

Thanks again for this clear tuto !

BernyHi, thanks for this tutorial.

I want to ask you if it is possible to use a function define by me to calculate the distance or i must use the distance defined by the scipy.

Thanks for your answer

joernPost authoryou can use self defined distance functions

JoshHow do you use `link_color_func` in `dendrogram` ? I have a dictionary where `{key=node/leaf : value=hex _color}` ?

O.rkaCan you explain what’s happening in:

“`

for i, d, c in zip(ddata[‘icoord’], ddata[‘dcoord’], ddata[‘color_list’]):

x = 0.5 * sum(i[1:3])

y = d[1]

“`

joernPost authorhehe, yeah, that’s a “hack” to draw the annotations for the heights.

x is the middle between the two connected clusters (so the middle between the vertical lines).

y is the height of the bridge.

If we’re above a certain height, we’ll plot an additional point in the cluster’s color and write the height value below it.

WillThanks for the post, very helpful.

In case it’s useful to others reading here: fcluster inexplicably returns an array whose minimum entry is 1, not 0. Beware when using it to index numpy arrays.

Alejandrogreat, great tutorial!

greetings from BA, Argentina

joernPost authorthanks, glad it helped

ronyThank you.. awesome post..!!

joernPost authorthanks, and i’m happy to see that i’m not the only one who uses spam-tracking emails 😉

Mark GVery well written and insightful, thank you Joern

KyleThank you so much for this, it was a really excellent tutorial!

Nutchaya P.Thank you so much, your tutorial is very helpful.

I have a question. For the dendrogram on the x axis, how can I label it with the values of the samples instead of indices ?

MegThank you very much. This is truly great! I have a question though. Do you know of a way to introduce new samples to the process after the clustering is finished? Just like when we expand an SOM. What I want to do is to create my clusters with a training set and test it with a testing set. Is that possible?

kerr hgreat post. one of the reasons people use python so much is the quality of the online material and examples available, and your tutorial on clustering is a perfect example.

joernPost author🙂 thanks

LorienI’m using Python, numpy and scipy to do some hierarchical clustering on the output of a topic model I created for text analysis.

I applied my testcorpus to the ldamodel so it became a bag-of-words representation. Then I turned it into a matrix. Now I want to use scipy to make a linkage matrix of my matrix. But it gives the Value Error: setting an array element with a sequence. I guess that this is because only equally shaped arrays can be clustered. And my matrix has a difference in lengths between the lists inside the list of lists. I just don’t now how to solve this. Here is a little part of the code. I don’t know if it is helpful. I just really hope someone can help me.

import numpy as np

X = np.array(corpus)

from matplotlib import pyplot as plt

from scipy.cluster.hierarchy import dendrogram, linkage

Z = linkage(X, ‘cosine’)

joernPost authorhmm, not knowing how your

`corpus`

looks like, i can only guess… my best guess is that you’re dealing with a list of some kind of LDA dicts:`[{dim1: score1, dim2: score2,...},...]`

. As you indicate scipy doesn’t really understand this when you do`np.array(corpus)`

, so you need to transform that list of feature-mappings into a list of feature-vectors, either manually or with sklearn.feature_extraction.DictVectorizer.fit_transform for example.AlexGhitiThank you very much for this tutorial, it really helps 🙂

ErnestoOnly one word to describe your work: Excellent

joernPost authorthanks 😉

babathis is an excellent tutorial, thanks!! Is there an easy way to make this an online algorithm? i.e. if we have new data points, you can merge them into existing clusters?

joernPost authorthanks. not really (or at least i don’t see any). The reason is that a single added data point could bridge points (low level clusters) that are otherwise merged very late in the process. The whole tree would look different in that case.

JoshanTanks for your post.. Helped me a lot.. have any idea about datafiled potential function in clustering techniques.

joernPost authorthanks. i don’t understand the question though, sorry…

shaihello, the code for retrieving k clusters does not word, gives an error:

raise ValueError(“Invalid RGBA argument: {!r}”.format(orig_c))

ValueError: Invalid RGBA argument: 1

any ideas why?

bpHi,

I am new to data analysis and Python in itself. I was looking at hierarchical clustering and chanced on your tutorial. While your tutorial is pretty easy to follow (thank you!), I am confused if I can use it in my use case. I have a complete weighted undirected graph and I need to find clusters in that graph. I am confused how I can draw a dendrogram to visualize my graph such that the y axis ranges from the min of all edge weights to the max of all edge weights. Any help is highly appreciated. Thank you.

joernPost authorIf you have a fully connected graph and just want to use the edge weights as distance directly, you can actually pass a square distance matrix into the linkage function: https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html

In your case that would just be the weighted adjacency matrix.

bpNevermind, figured it out. Your tutorial is very helpful – Thank you!

EthanHi, thank you very much for this tutorial it is very useful.

I did not understand how the inconsistency values are used to identify optimal # of clusters. Can you please give more explanation about it.

Thanks

joernPost authorHi, uhm the point of my explanations around the inconsistency method are that it’s not very reliable. In theory one is interested in outliers of the inconsistency score, but as shown it’s somewhat arbitrary. So i’d suggest to look at the inconsistency scores and only take its outliers as indicators for a reasonable “K”. I’d always suggest to afterwards go back to the dendrogram and check if it makes sense.

LucyHi Joern and thank you for this great tutorial. I helped me a lot!

Someone has posted about getting a sample indices within the cluster and stored into a dictionary. I did the solution that you recommended in which each cluster list contains the size and the value, i.e. sample index.

My question is, instead of returning the sample index, what can I do to return the data id itself? In prior to the clustering, I have tagged my data so that each data has a unique id.

The reason for this is because I want to do an analysis that involves identification and measuring the amount of data from the same class that were clustered in the same cluster.

Thanks in advance!

Lucy.

joernPost authorhi, sorry for the late reply…

I’m not sure i fully understand the question. If all you want to do is get from the idx back to your data, you should be able to do

`X[idx]`

. If there’s further information that is not in`X`

connected to that data row, you can always have an external other dictionary to do the lookup!?!Todd WightThis was awesome. Thanks!

Pingback: Hierarchical Clustering | Just My Notes

FedericoI’m here to investigate how the third column of the dendrogram (distances) is computed. I’m currently using method=’single’, metric=’euclidean’ and I don’t understand why in the distance column of the dendrogram all values are different from the ones provided in the 2d array of observation vectors. For example, in the first iteration when we merge two clusters of size 1 each, I would expect the distance to be somewhere in the 2d array of observation vectors.

You stated the following:

“No matter what method and metric you pick, the linkage() function will use that method and metric (i.e. cophenet) to calculate the distances of the clusters (starting with your n individual samples (aka data points) as singleton clusters)) and in each iteration will merge the two clusters which have the smallest distance according the selected method and metric.”

However, in the documentation for scipy.cluster.hierarchy.linkage, cophenet is never mentioned. In other words, I don’t understand why cophenet should bypass method and metric. Could you please support the previous statement?

joernPost authoryes, i’d also expect the distance to appear in the linkage results. Without a code example i can only guess that you did not calculate the condensed

`pdist`

, but instead passed`X`

into`linkage`

. In that case scipy will implicitly calculate the distances. In other words: you won’t find rows of`X`

in`Z`

, but you should find`pdist(X)`

rows in`Z`

, especially for`method='single'`

.Wrt. the second part, i don’t know where you got that quote from. I never made that statement about cophenet. Cophenet is neither metric nor method. Obviously, if you’re calculating cophenet on

`Z, pdist(X)`

, you should pass the corresponding metric into`pdist`

:`pdist(X, metric='...')`

.Praful KThis is an awesome post ! Impressive clarity of thought and buildup. Thank you.

Harry BakerHey, this was an awesome explanation.

I need a little bit of help going further with my clustering. I want to do some labeling of the cluster’s I’ve found, so I’m trying to figure out how to grab every object that appears in each of the final N truncated clusters. I’m clustering documents based on topic model distributions, so I want to know which topics describe each of the clusters that I found.

Thanks!

AbhishekThanks Joern I understood Y axis of the dendrogram from your blog. Which was crucial to differentiate clustring using WARD and AVERAGE.

WARD has 80 as the range of distance ( Y axis)

AVERAGE has only 14 as the distance range (Y axis)

By the above details, I concluded that WARD is the better distance measure than AVERAGE.

Please let me know if my hypothesis is appropriate.

Shamiek MangipudiFantastic tutorial with tremendous insight. Really loved the flow and the ease with which the author presented the content. This should deserve a gold on ELIF.

joernPost author😉 thanks

LironThis is unbelievable. Clear, well-explained, interactive. Thanks!

miinnaHi is it possible to find out all the clustering nodes in a pair without getting merged( what are getting merged after one iteration). For example, I have 2 cluster {1,2}, and {3,2} . the linkage methods are showing [{1,2} ],[{3,4} ] here 4 is the merged number of {1,2} cluster.am trying to show like [{1,2} ],[{3,(1,2)} ]

can anybody help…!

Pingback: Baby Steps – Ground Zero

Ram BGreat Tutorial! I really appreciate you taking the time to write this piece!!

shanhegreat works! thanks for a lot

FouadYour tutorial is really clear and helpful, thanks a lot !

hewgreenClass tutorial pal. Much appreciated.

ThiagoThanks so much. It is really very helpful, specially because you discuss on comments, not only the post. You give support to us.

So, if I already have the distance matrix between my data, I have to pass the matrix in triangular shape, otherwise it will be interpreted as the dataset?

In my case, I have pseudo-distance, since diagonal is not zero, but constant equal to 0.5. Should I keep it in the triangular matrix or only the upper part?

Regards.

NaveinThanks for your tutorial on this. I was wondering, is it possible to use hierarchical clustering to cluster results from minhash LSH? I looked around and could not find any tutorials or codes on it, but I have read about it in some papers online.

joernPost authorhmm, tricky question 😉

Depending on your point of view, locality-sensitive hashing can be seen as an alternative to clustering (LSH is already meant to put things that are similar close together in a lower dim vector space), a dimensionality reduction technique or a way to quickly retrieve candidates that are hopefully close together…

So when you ask how to do hierarchical clustering on results from LSH, you could either just apply hierarchical clustering on the lower dim LSH vector space representation of each item, or you could use LSH to quickly retrieve candidates. In the former case, your vector space will very likely have weird properties and will require attention to the selection of your clustering variant and the used distance function. In the latter case, i’d suggest to try and get the original vector space representation (pre-LSH) and then cluster the candidates in that vector space, which is probably much easier to “interpret”.

PacoHi Joern and thanks for a nice tutorial.

Now, there comes the question. Is this applicable to time series? My data matrix consists of different daily temperature time series for different locations on each column. I would like to build clusters so I can regionally group locations based on temperature time series. Do you think your example could be applied for my problem?

Thanks in advance and best regards

joernPost authorthanks 🙂

hmmm, not sure if i understand the question. I’d say you that as soon as you have some vector-space representation of your data, you can use clustering techniques. The main difficulty however is finding a meaningful distance measure for your specific use-case.

In your case, the combination of geo-locs, time series and temperatures will most likely not be euclidean 😉 So you’ll want to define your own distance measure by asking yourself something like: “What’s the distance of two datapoints that are 1km apart and have the same temperature but are recorded 4 months apart?” Vary all combinations of the dimensions and think about how you’d like your “distance” to change. Also be aware that many clustering variants will require certain properties of your defined “distance” like (symmetry, reflexivity, transitivity, triangle inequality…).

Michael RoseThe link in elbow plot named “out there” leads to a random placeholder website.

joernPost authorthanks, updated 🙂

AlfredoThank you very much for this great, useful tutorial !

Alexandre LarrainThank you very much Jörn, this is very clear and awesome

Pingback: Online reviews: keyword clusters - Data Science at Reputation.com