Tag Archives: python

SciPy Hierarchical Clustering and Dendrogram Tutorial

This is a tutorial on how to use scipy's hierarchical clustering.

One of the benefits of hierarchical clustering is that you don't need to already know the number of clusters k in your data in advance. Sadly, there doesn't seem to be much documentation on how to actually use scipy's hierarchical clustering to make an informed decision and then retrieve the clusters.

In the following I'll explain:

Naming conventions:

Before we start, as i know that it's easy to get lost, some naming conventions:

  • X samples (n x m array), aka data points or "singleton clusters"
  • n number of samples
  • m number of features
  • Z cluster linkage array (contains the hierarchical clustering information)
  • k number of clusters

So, let's go.

Imports and Setup

In [1]:
# needed imports
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
import numpy as np
In [2]:
# some setting for this notebook to actually show the graphs inline, you probably won't need this
%matplotlib inline
np.set_printoptions(precision=5, suppress=True)  # suppress scientific float notation

Generating Sample Data

You'll obviously not need this step to run the clustering if you have own data.

The only thing you need to make sure is that you convert your data into a matrix X with n samples and m features, so that X.shape == (n, m).

In [3]:
# generate two clusters: a with 100 points, b with 50:
np.random.seed(4711)  # for repeatability of this tutorial
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
X = np.concatenate((a, b),)
print X.shape  # 150 samples with 2 dimensions
plt.scatter(X[:,0], X[:,1])
plt.show()
(150, 2)

Perform the Hierarchical Clustering

Now that we have some very simple sample data, let's do the actual clustering on it:

In [4]:
# generate the linkage matrix
Z = linkage(X, 'ward')

Done. That was pretty simple, wasn't it?

Well, sure it was, this is python ;), but what does the weird 'ward' mean there and how does this actually work?

As the scipy linkage docs tell us, 'ward' is one of the methods that can be used to calculate the distance between newly formed clusters. 'ward' causes linkage() to use the Ward variance minimization algorithm.

I think it's a good default choice, but it never hurts to play around with some other common linkage methods like 'single', 'complete', 'average', ... and the different distance metrics like 'euclidean' (default), 'cityblock' aka Manhattan, 'hamming', 'cosine'... if you have the feeling that your data should not just be clustered to minimize the overall intra cluster variance in euclidean space. For example, you should have such a weird feeling with long (binary) feature vectors (e.g., word-vectors in text clustering).

As you can see there's a lot of choice here and while python and scipy make it very easy to do the clustering, it's you who has to understand and make these choices. If i find the time, i might give some more practical advice about this, but for now i'd urge you to at least read up on the linked methods and metrics to make a somewhat informed choice. Another thing you can and should definitely do is check the Cophenetic Correlation Coefficient of your clustering with help of the cophenet() function. This (very very briefly) compares (correlates) the actual pairwise distances of all your samples to those implied by the hierarchical clustering. The closer the value is to 1, the better the clustering preserves the original distances, which in our case is pretty close:

In [5]:
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist

c, coph_dists = cophenet(Z, pdist(X))
c
Out[5]:
0.98001483875742679

No matter what method and metric you pick, the linkage() function will use that method and metric to calculate the distances of the clusters (starting with your n individual samples (aka data points) as singleton clusters)) and in each iteration will merge the two clusters which have the smallest distance according the selected method and metric. It will return an array of length n - 1 giving you information about the n - 1 cluster merges which it needs to pairwise merge n clusters. Z[i] will tell us which clusters were merged in the i-th iteration, let's take a look at the first two points that were merged:

In [6]:
Z[0]
Out[6]:
array([ 52.     ,  53.     ,   0.04151,   2.     ])

We can see that ach row of the resulting array has the format [idx1, idx2, dist, sample_count].

In its first iteration the linkage algorithm decided to merge the two clusters (original samples here) with indices 52 and 53, as they only had a distance of 0.04151. This created a cluster with a total of 2 samples.

In [7]:
Z[1]
Out[7]:
array([ 14.     ,  79.     ,   0.05914,   2.     ])

In the second iteration the algorithm decided to merge the clusters (original samples here as well) with indices 14 and 79, which had a distance of 0.04914. This again formed another cluster with a total of 2 samples.

The indices of the clusters until now correspond to our samples. Remember that we had a total of 150 samples, so indices 0 to 149. Let's have a look at the first 20 iterations:

In [8]:
Z[:20]
Out[8]:
array([[  52.     ,   53.     ,    0.04151,    2.     ],
       [  14.     ,   79.     ,    0.05914,    2.     ],
       [  33.     ,   68.     ,    0.07107,    2.     ],
       [  17.     ,   73.     ,    0.07137,    2.     ],
       [   1.     ,    8.     ,    0.07543,    2.     ],
       [  85.     ,   95.     ,    0.10928,    2.     ],
       [ 108.     ,  131.     ,    0.11007,    2.     ],
       [   9.     ,   66.     ,    0.11302,    2.     ],
       [  15.     ,   69.     ,    0.11429,    2.     ],
       [  63.     ,   98.     ,    0.1212 ,    2.     ],
       [ 107.     ,  115.     ,    0.12167,    2.     ],
       [  65.     ,   74.     ,    0.1249 ,    2.     ],
       [  58.     ,   61.     ,    0.14028,    2.     ],
       [  62.     ,  152.     ,    0.1726 ,    3.     ],
       [  41.     ,  158.     ,    0.1779 ,    3.     ],
       [  10.     ,   83.     ,    0.18635,    2.     ],
       [ 114.     ,  139.     ,    0.20419,    2.     ],
       [  39.     ,   88.     ,    0.20628,    2.     ],
       [  70.     ,   96.     ,    0.21931,    2.     ],
       [  46.     ,   50.     ,    0.22049,    2.     ]])

We can observe that until iteration 13 the algorithm only directly merged original samples. We can also observe the monotonic increase of the distance.

In iteration 13 the algorithm decided to merge cluster indices 62 with 152. If you paid attention the 152 should astonish you as we only have original sample indices 0 to 149 for our 150 samples. All indices idx >= len(X) actually refer to the cluster formed in Z[idx - len(X)].

This means that while idx 149 corresponds to X[149] that idx 150 corresponds to the cluster formed in Z[0], idx 151 to Z[1], 152 to Z[2], ...

Hence, the merge iteration 13 merged sample 62 to our samples 33 and 68 that were previously merged in iteration 2 (152 - 2).

Let's check out the points coordinates to see if this makes sense:

In [9]:
X[[33, 68, 62]]
Out[9]:
array([[ 9.83913, -0.4873 ],
       [ 9.89349, -0.44152],
       [ 9.97793, -0.56383]])

Seems pretty close, but let's plot the points again and highlight them:

In [10]:
idxs = [33, 68, 62]
plt.figure(figsize=(10, 8))
plt.scatter(X[:,0], X[:,1])  # plot all points
plt.scatter(X[idxs,0], X[idxs,1], c='r')  # plot interesting points in red again
plt.show()

We can see that the 3 red dots are pretty close to each other, which is a good thing.

The same happened in iteration 14 where the alrogithm merged indices 41 to 15 and 69:

In [11]:
idxs = [33, 68, 62]
plt.figure(figsize=(10, 8))
plt.scatter(X[:,0], X[:,1])
plt.scatter(X[idxs,0], X[idxs,1], c='r')
idxs = [15, 69, 41]
plt.scatter(X[idxs,0], X[idxs,1], c='y')
plt.show()

Showing that the 3 yellow dots are also quite close.

And so on...

We'll later come back to visualizing this, but now let's have a look at what's called a dendrogram of this hierarchical clustering first:

Plotting a Dendrogram

A dendrogram is a visualization in form of a tree showing the order and distances of merges during the hierarchical clustering.

In [12]:
# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    Z,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=8.,  # font size for the x axis labels
)
plt.show()

(right click and "View Image" to see full resolution)

If this is the first time you see a dendrogram, it's probably quite confusing, so let's take this apart...

  • On the x axis you see labels. If you don't specify anything else they are the indices of your samples in X.
  • On the y axis you see the distances (of the 'ward' method in our case).

Starting from each label at the bottom, you can see a vertical line up to a horizontal line. The height of that horizontal line tells you about the distance at which this label was merged into another label or cluster. You can find that other cluster by following the other vertical line down again. If you don't encounter another horizontal line, it was just merged with the other label you reach, otherwise it was merged into another cluster that was formed earlier.

Summarizing:

  • horizontal lines are cluster merges
  • vertical lines tell you which clusters/labels were part of merge forming that new cluster
  • heights of the horizontal lines tell you about the distance that needed to be "bridged" to form the new cluster

You can also see that from distances > 25 up there's a huge jump of the distance to the final merge at a distance of approx. 180. Let's have a look at the distances of the last 4 merges:

In [13]:
Z[-4:,2]
Out[13]:
array([  15.11533,   17.11527,   23.12199,  180.27043])

Such distance jumps / gaps in the dendrogram are pretty interesting for us. They indicate that something is merged here, that maybe just shouldn't be merged. In other words: maybe the things that were merged here really don't belong to the same cluster, telling us that maybe there's just 2 clusters here.

Looking at indices in the above dendrogram also shows us that the green cluster only has indices >= 100, while the red one only has such < 100. This is a good thing as it shows that the algorithm re-discovered the two classes in our toy example.

In case you're wondering about where the colors come from, you might want to have a look at the color_threshold argument of dendrogram(), which as not specified automagically picked a distance cut-off value of 70 % of the final merge and then colored the first clusters below that in individual colors.

Dendrogram Truncation

As you might have noticed, the above is pretty big for 150 samples already and you probably have way more in real scenarios, so let me spend a few seconds on highlighting some other features of the dendrogram() function:

In [14]:
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    Z,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=12,  # show only the last p merged clusters
    show_leaf_counts=False,  # otherwise numbers in brackets are counts
    leaf_rotation=90.,
    leaf_font_size=12.,
    show_contracted=True,  # to get a distribution impression in truncated branches
)
plt.show()

The above shows a truncated dendrogram, which only shows the last p=12 out of our 149 merges.

First thing you should notice are that most labels are missing. This is because except for X[40] all other samples were already merged into clusters before the last 12 merges.

The parameter show_contracted allows us to draw black dots at the heights of those previous cluster merges, so we can still spot gaps even if we don't want to clutter the whole visualization. In our example we can see that the dots are all at pretty small distances when compared to the huge last merge at a distance of 180, telling us that we probably didn't miss much there.

As it's kind of hard to keep track of the cluster sizes just by the dots, dendrogram() will by default also print the cluster sizes in brackets () if a cluster was truncated:

In [15]:
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index or (cluster size)')
plt.ylabel('distance')
dendrogram(
    Z,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=12,  # show only the last p merged clusters
    leaf_rotation=90.,
    leaf_font_size=12.,
    show_contracted=True,  # to get a distribution impression in truncated branches
)
plt.show()

We can now see that the right most cluster already consisted of 33 samples before the last 12 merges.

Eye Candy

Even though this already makes for quite a nice visualization, we can pimp it even more by also annotating the distances inside the dendrogram by using some of the useful return values dendrogram():

In [16]:
def fancy_dendrogram(*args, **kwargs):
    max_d = kwargs.pop('max_d', None)
    if max_d and 'color_threshold' not in kwargs:
        kwargs['color_threshold'] = max_d
    annotate_above = kwargs.pop('annotate_above', 0)

    ddata = dendrogram(*args, **kwargs)

    if not kwargs.get('no_plot', False):
        plt.title('Hierarchical Clustering Dendrogram (truncated)')
        plt.xlabel('sample index or (cluster size)')
        plt.ylabel('distance')
        for i, d, c in zip(ddata['icoord'], ddata['dcoord'], ddata['color_list']):
            x = 0.5 * sum(i[1:3])
            y = d[1]
            if y > annotate_above:
                plt.plot(x, y, 'o', c=c)
                plt.annotate("%.3g" % y, (x, y), xytext=(0, -5),
                             textcoords='offset points',
                             va='top', ha='center')
        if max_d:
            plt.axhline(y=max_d, c='k')
    return ddata
In [17]:
fancy_dendrogram(
    Z,
    truncate_mode='lastp',
    p=12,
    leaf_rotation=90.,
    leaf_font_size=12.,
    show_contracted=True,
    annotate_above=10,  # useful in small plots so annotations don't overlap
)
plt.show()

Selecting a Distance Cut-Off aka Determining the Number of Clusters

As explained above already, a huge jump in distance is typically what we're interested in if we want to argue for a certain number of clusters. If you have the chance to do this manually, i'd always opt for that, as it allows you to gain some insights into your data and to perform some sanity checks on the edge cases. In our case i'd probably just say that our cut-off is 50, as the jump is pretty obvious:

In [18]:
# set cut-off to 50
max_d = 50  # max_d as in max_distance

Let's visualize this in the dendrogram as a cut-off line:

In [19]:
fancy_dendrogram(
    Z,
    truncate_mode='lastp',
    p=12,
    leaf_rotation=90.,
    leaf_font_size=12.,
    show_contracted=True,
    annotate_above=10,
    max_d=max_d,  # plot a horizontal cut-off line
)
plt.show()

As we can see, we ("surprisingly") have two clusters at this cut-off.

In general for a chosen cut-off value max_d you can always simply count the number of intersections with vertical lines of the dendrogram to get the number of formed clusters. Say we choose a cut-off of max_d = 16, we'd get 4 final clusters:

In [20]:
fancy_dendrogram(
    Z,
    truncate_mode='lastp',
    p=12,
    leaf_rotation=90.,
    leaf_font_size=12.,
    show_contracted=True,
    annotate_above=10,
    max_d=16,
)
plt.show()

Automated Cut-Off Selection (or why you shouldn't rely on this)

Now while this manual selection of a cut-off value offers a lot of benefits when it comes to checking for a meaningful clustering and cut-off, there are cases in which you want to automate this.

The problem again is that there is no golden method to pick the number of clusters for all cases (which is why i think the investigative & backtesting manual method is preferable). Wikipedia lists a couple of common methods. Reading this, you should realize how different the approaches and how vague their descriptions are.

I honestly think it's a really bad idea to just use any of those methods, unless you know the data you're working on really really well.

Inconsistency Method

For example, let's have a look at the "inconsistency" method, which seems to be one of the defaults for the fcluster() function in scipy.

The question driving the inconsistency method is "what makes a distance jump a jump?". It answers this by comparing each cluster merge's height h to the average avg and normalizing it by the standard deviation std formed over the depth previous levels:

$$inconsistency = \frac{h - avg}{std}$$

The following shows a matrix of the avg, std, count, inconsistency for each of the last 10 merges of our hierarchical clustering with depth = 5

In [21]:
from scipy.cluster.hierarchy import inconsistent

depth = 5
incons = inconsistent(Z, depth)
incons[-10:]
Out[21]:
array([[  1.80875,   2.17062,  10.     ,   2.44277],
       [  2.31732,   2.19649,  16.     ,   2.52742],
       [  2.24512,   2.44225,   9.     ,   2.37659],
       [  2.30462,   2.44191,  21.     ,   2.63875],
       [  2.20673,   2.68378,  17.     ,   2.84582],
       [  1.95309,   2.581  ,  29.     ,   4.05821],
       [  3.46173,   3.53736,  28.     ,   3.29444],
       [  3.15857,   3.54836,  28.     ,   3.93328],
       [  4.9021 ,   5.10302,  28.     ,   3.57042],
       [ 12.122  ,  32.15468,  30.     ,   5.22936]])

Now you might be tempted to say "yay, let's just pick 5" as a limit in the inconsistencies, but look at what happens if we set depth to 3 instead:

In [22]:
depth = 3
incons = inconsistent(Z, depth)
incons[-10:]
Out[22]:
array([[  3.63778,   2.55561,   4.     ,   1.35908],
       [  3.89767,   2.57216,   7.     ,   1.54388],
       [  3.05886,   2.66707,   6.     ,   1.87115],
       [  4.92746,   2.7326 ,   7.     ,   1.39822],
       [  4.76943,   3.16277,   6.     ,   1.60456],
       [  5.27288,   3.56605,   7.     ,   2.00627],
       [  8.22057,   4.07583,   7.     ,   1.69162],
       [  7.83287,   4.46681,   7.     ,   2.07808],
       [ 11.38091,   6.2943 ,   7.     ,   1.86535],
       [ 37.25845,  63.31539,   7.     ,   2.25872]])

Oups! This should make you realize that the inconsistency values heavily depend on the depth of the tree you calculate the averages over.

Another problem in its calculation is that the previous d levels' heights aren't normally distributed, but expected to increase, so you can't really just treat the current level as an "outlier" of a normal distribution, as it's expected to be bigger.

Elbow Method

Another thing you might see out there is a variant of the "elbow method". It tries to find the clustering step where the acceleration of distance growth is the biggest (the "strongest elbow" of the blue line graph below, which is the highest value of the green graph below):

In [23]:
last = Z[-10:, 2]
last_rev = last[::-1]
idxs = np.arange(1, len(last) + 1)
plt.plot(idxs, last_rev)

acceleration = np.diff(last, 2)  # 2nd derivative of the distances
acceleration_rev = acceleration[::-1]
plt.plot(idxs[:-2] + 1, acceleration_rev)
plt.show()
k = acceleration_rev.argmax() + 2  # if idx 0 is the max of this we want 2 clusters
print "clusters:", k
clusters: 2

While this works nicely in our simplistic example (the green line takes its maximum for k=2), it's pretty flawed as well.

One issue of this method has to do with the way an "elbow" is defined: you need at least a right and a left point, which implies that this method will never be able to tell you that all your data is in one single cluster only.

Another problem with this variant lies in the np.diff(Z[:, 2], 2) though. The order of the distances in Z[:, 2] isn't properly reflecting the order of merges within one branch of the tree. In other words: there is no guarantee that the distance of Z[i] is contained in the branch of Z[i+1]. By simply computing the np.diff(Z[:, 2], 2) we assume that this doesn't matter and just compare distance jumps from different branches of our merge tree.

If you still don't want to believe this, let's just construct another simplistic example but this time with very different variances in the different clusters:

In [24]:
c = np.random.multivariate_normal([40, 40], [[20, 1], [1, 30]], size=[200,])
d = np.random.multivariate_normal([80, 80], [[30, 1], [1, 30]], size=[200,])
e = np.random.multivariate_normal([0, 100], [[100, 1], [1, 100]], size=[200,])
X2 = np.concatenate((X, c, d, e),)
plt.scatter(X2[:,0], X2[:,1])
plt.show()

As you can see we have 5 clusters now, but they have increasing variances... let's have a look at the dendrogram again and how you can use it to spot the problem:

In [25]:
Z2 = linkage(X2, 'ward')
plt.figure(figsize=(10,10))
fancy_dendrogram(
    Z2,
    truncate_mode='lastp',
    p=30,
    leaf_rotation=90.,
    leaf_font_size=12.,
    show_contracted=True,
    annotate_above=40,
    max_d=170,
)
plt.show()

When looking at a dendrogram like this and trying to put a cut-off line somewhere, you should notice the very different distributions of merge distances below that cut-off line. Compare the distribution in the cyan cluster to the red, green or even two blue clusters that have even been truncated away. In the cyan cluster below the cut-off we don't really have any discontinuity of merge distances up to very close to the cut-off line. The two blue clusters on the other hand are each merged below a distance of 25, and have a gap of > 155 to our cut-off line.

The variant of the "elbow" method will incorrectly see the jump from 167 to 180 as minimal and tell us we have 4 clusters:

In [26]:
last = Z2[-10:, 2]
last_rev = last[::-1]
idxs = np.arange(1, len(last) + 1)
plt.plot(idxs, last_rev)

acceleration = np.diff(last, 2)  # 2nd derivative of the distances
acceleration_rev = acceleration[::-1]
plt.plot(idxs[:-2] + 1, acceleration_rev)
plt.show()
k = acceleration_rev.argmax() + 2  # if idx 0 is the max of this we want 2 clusters
print "clusters:", k
clusters: 4

The same happens with the inconsistency metric:

In [27]:
print inconsistent(Z2, 5)[-10:]
[[  13.99222   15.56656   30.         3.86585]
 [  16.73941   18.5639    30.         3.45983]
 [  19.05945   20.53211   31.         3.49953]
 [  19.25574   20.82658   29.         3.51907]
 [  21.36116   26.7766    30.         4.50256]
 [  36.58101   37.08602   31.         3.50761]
 [  12.122     32.15468   30.         5.22936]
 [  42.6137   111.38577   31.         5.13038]
 [  81.75199  208.31582   31.         5.30448]
 [ 147.25602  307.95701   31.         3.6215 ]]

I hope you can now understand why i'm warning against blindly using any of those methods on a dataset you know nothing about. They can give you some indication, but you should always go back in and check if the results make sense, for example with a dendrogram which is a great tool for that (especially if you have higher dimensional data that you can't simply visualize anymore).

Retrieve the Clusters

Now, let's finally have a look at how to retrieve the clusters, for different ways of determining k. We can use the fcluster function.

Knowing max_d:

Let's say we determined the max distance with help of a dendrogram, then we can do the following to get the cluster id for each of our samples:

In [28]:
from scipy.cluster.hierarchy import fcluster
max_d = 50
clusters = fcluster(Z, max_d, criterion='distance')
clusters
Out[28]:
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Knowing k:

Another way starting from the dendrogram is to say "i can see i have k=2" clusters. You can then use:

In [29]:
k=2
fcluster(Z, k, criterion='maxclust')
Out[29]:
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Using the Inconsistency Method (default):

If you're really sure you want to use the inconsistency method to determine the number of clusters in your dataset, you can use the default criterion of fcluster() and hope you picked the correct values:

In [30]:
from scipy.cluster.hierarchy import fcluster
fcluster(Z, 8, depth=10)
Out[30]:
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Visualizing Your Clusters

If you're lucky enough and your data is very low dimensional, you can actually visualize the resulting clusters very easily:

In [31]:
plt.figure(figsize=(10, 8))
plt.scatter(X[:,0], X[:,1], c=clusters, cmap='prism')  # plot points with cluster dependent colors
plt.show()

I hope you enjoyed this tutorial. Feedback welcome ;)

Further Reading:

Scientific Python on Mac OS X 10.9+ with homebrew

There are (too) many guides out there about how to install Python on Mac OS X. So why another one? Cause i found most of the other ways to lead to long-term maintenance debt that is unnecessary and can easily be avoided. Also many of the other guides aren’t very easy to extend in “but i also need that library”-cases. So here I’ll briefly explain how to set-up a scientific python stack based on homebrew. The advantages are that this stack will never conflict with your system’s core and homebrew opens up a whole new world of easy to access unix tools via a simple brew install ....

This step-by-step installation guide to setup a scientific python environment has been tested on Mac OS X Mavericks 10.9 / Yosemite 10.10 / El Capitain 10.11 / Sierra 10.12. It will probably also work in the following versions (if not, let me know in the comments).
An older version of this setup guide can be found here: Scientific Python on Mac OS X 10.8 with homebrew, main changes: rename of a tap + changes wrt. openblas as Accelerate was fixed in OS X 10.9

Needless to say: Make a backup (Timemachine)

First install homebrew.
Follow their instructions, then come back here.

If you don’t have a clean install, some of the following steps might need minor additional attention (like changing permissions chmod, chown, chgrp or overwriting existing files in the linking step with brew link --overwrite package_that_failed. In this case i can only recommend a backup again).

In general: execute the following commands one line at a time and read the outputs! If you read some warnings about “keg-only” that’s fine, it just means that brew won’t “hide” your system’s stuff behind the stuff it installed itself so it doesn’t cause problems… brewed stuff will still use it.

# set up some taps and update brew
brew tap homebrew/science # a lot of cool formulae for scientific tools
brew tap homebrew/python # numpy, scipy, matplotlib, ...
brew update && brew upgrade

# install a brewed python
brew install python

# and/or if you want to give python3 a go (it's about time):
#brew install python3  # repeat/replace the following pip commands with pip3

A word about brewed python: this is what you want!

It’s more up to date than the system python, it will come with pip and correctly install in the brew directories, working together well with brewed python libs that aren’t installable with plain pip. This also means that pip by default will work without sudo as all of homebrew, so if you ever run or have to run sudo pip ... because of missing permissions, then you’re doing it wrong! Also, don’t be afraid of multiple pythons on your system: you have them anyhow (python2 and python3) and it’s an advantage, as we’ll make sure that nothing poisons your system python and that you as a user & developer will use the brewed python:

hash -r python  # invalidate bash's lookup cache for python
which python
# should say /usr/local/bin/python

echo $PATH
# /usr/local/bin should appear in front of /usr/bin

If this is not the case you’d probably end up not using brewed python. Please check your brew install with brew doctor, it will probably tell you that you should consider updating your paths in ~/.bashrc. You can either follow its directions or create a ~/.profile file like this one: ~/.profile. If you performed these steps, please close your terminal windows and open a new one for the changes to take effect. Test the above again.

Even if the above check worked, run the following anyhow and read through its output (no output is good):

brew doctor

Pay special attention if this tells you to install XQuartz, and if it does, install it! You’ll need it anyhow…

Now after these preparations, let’s really start installing stuff… below you’ll mostly find one package / lib per line. For each of them and for their possible options: they’re a recommendation that might save you some trouble, so i’d recommend to install all of them as i write here, even if specifying some of the options will compile brew packages from source and take a bit longer…

# install PIL, imagemagick, graphviz and other
# image generating stuff
brew install libtiff libjpeg webp little-cms2
pip install Pillow
brew install imagemagick --with-fftw --with-librsvg --with-x11
brew install graphviz --with-librsvg --with-x11
brew install cairo
brew install py2cairo # this will ask you to download xquartz and install it
brew install qt pyqt5

# install virtualenv, nose (unittests & doctests on steroids)
pip install virtualenv
pip install nose

# install numpy and scipy
# nowadays there are two good ways to install numpy and scipy: via pip or via brew.
# PICK ONE:
# - i prefer pip for proper virtualenv support and more up-to-date versions.
# - brew builds are a bit older, but handy if you need a source build
pip install numpy
pip install scipy
# OR:
# (if you want to run numpy and scipy with openblas also remove comments below:)
#brew install openblas
#brew install numpy # --with-openblas
#brew install scipy # --with-openblas

# test the numpy & scipy install
python -c 'import numpy ; numpy.test();'
python -c 'import scipy ; scipy.test();'

# some cool python libs (if you don't know them, look them up)
# matplotlib: generate plots
# pandas: time series stuff
# nltk: natural language toolkit
# sympy: symbolic maths in python
# q: fancy debugging output
# snakeviz: cool visualization of profiling output (aka what's taking so long?)
#brew install Caskroom/cask/mactex  # if you want to install matplotlib with tex support and don't have mactex installed already
brew install matplotlib --with-cairo --with-tex  # cairo: png ps pdf svg filetypes, tex: tex fonts & formula in plots
pip install pandas
pip install nltk
pip install sympy
pip install q
pip install snakeviz

# ipython/jupyter with parallel and notebook support
brew install zmq
pip install ipython[all]

# html stuff (parsing)
pip install html5lib cssselect pyquery lxml BeautifulSoup

# webapps / apis (choose what you like)
pip install Flask Django tornado

# semantic web stuff: rdf & sparql
pip install rdflib SPARQLWrapper

# graphs (graph metrics, social network analysis, layouting)
pip install networkx

# maintenance: updating of pip libs
pip list --outdated  # see Updating section below

Have fun 😉

Updating

OK, let’s say it’s been a while since you installed things with this guide and you now want to update all the installed libs. To do this you should first upgrade everything installed with brew like this:

brew update && brew outdated && brew upgrade

Afterwards for upgrading pip packages (globally or in a virtualenv) you can just run

pip list --outdated

to get a list of outdated packages and then manually update them with:

pip install -U package1 package2 ...

If you want a tiny bit more comfort you can use the pip-review package to do this interactively:

pip install pip-review

Once installed you should be able to run the following either in a virtualenv or globally for your whole system:

pip-review -i # for interactive mode, -a to upgrade all which is dangerous

It will check your installed packages for new versions and give you a list of outdated packages. I’d recommend to run it with the -i option to interactively install the upgrades.

A word of warning about the brewed packages: If i recommended to install a package with brew above that’s usually for a good reason like the pip version not working properly. If you’re a bit more advanced, you can try to upgrade them with pip, but i’d recommend to properly unlink them with brew unlink <package> before, as some pip packages might run into problems otherwise. If you find the pip package works like a charm then, please let me know in the comments below so i can update this guide. In general i prefer the pip packages as they’re more up to date, work in virtual environments and can then easily be updated with pip-review.

As always: If you liked this, think something is missing or wrong leave a comment.

Updates to this guide:

2014-03-02: include checking of $PATH for Mike
2015-03-17: enhanced many explanations, provided some useful options for packages, general workover
2015-04-15: included comment about installing mactex via cask if not there already, thanks to William
2015-06-05: Pillow via pip and Updating section
2015-11-01: pip-review (was detached from pip-tools) + alternative
2016-02-15: hash -r python (invalidate bash python bin lookup cache)
2016-12-21: updates for sierra, brew upgrade, python3 and some more comments
2017-03-30: updated pyqt’s package name to pyqt5

Scientific Python on Mac OS X 10.8 with homebrew

(newer version of this guide)

A step-by-step installation guide to setup a scientific python environment based on Mac OS X and homebrew.

Needless to say: Make a backup (Timemachine)

First install homebrew.
Follow their instructions, then come back here.

If you don’t have a clean install, some of the following steps might need minor additional attention (like changing permissions chmod, chown, chgrp or overwriting existing files in the linking step with brew link --overwrite package_that_failed. In this case i can only recommend a backup again).

In general: execute the following commands one line at a time and read the outputs! If you read some warnings about “keg-only” that’s fine, it just means that brew won’t “hide” your system’s stuff behind the stuff it installed itself so it doesn’t cause problems… brewed stuff will still use it.

# set up some taps and update brew
brew tap homebrew/science # a lot of cool formulae for scientific tools
brew tap homebrew/python # numpy, scipy
brew update && brew upgrade

# install a brewed python
brew install python

# install openblas (otherwise scipy's arpack tests will fail)
brew install openblas

# install PIL, imagemagick, graphviz and other
# image generating stuff (qt is nice for viewing)
brew install pillow imagemagick graphviz
brew install cairo --without-x
brew install py2cairo # this will ask you to download xquartz and install it
brew install qt pyqt

# install nose (unittests & doctests on steroids)
pip install virtualenv nose

# install numpy and scipy
brew install numpy --with-openblas # bug in Accelerate framework < Mac OS X 10.9
brew install scipy --with-openblas # bug in Accelerate framework < Mac OS X 10.9

# test the scipy install
brew test scipy

# some cool python libs (if you don't know them, look them up)
# time series stuff, natural language toolkit
# generate plots, symbolic maths in python, fancy debugging output
pip install pandas nltk matplotlib sympy q

# ipython and notebook support
brew install zmq
pip install ipython[zmq,qtconsole,notebook,test]

# html stuff (parsing)
pip install html5lib cssselect pyquery lxml BeautifulSoup

# webapps / apis (choose what you like)
pip install Flask Django

# semantic web stuff: rdf & sparql
pip install rdflib SPARQLWrapper

# picloud (easily run python scripts in the cloud)
pip install cloud

Have fun 😉

(As always: If you think something is missing leave a comment.)


update 2014-02-25: updated tap samualjohn/python to homebrew/python, new version linked

Python unicode doctest howto in a doctest

Another thing which has been on my stack for quite a while has been a unicode doctest howto, as I remember I was quite lost when I first tried to test encoding stuff in a doctest.
So I thought the ultimate way to show how to do this would be in a doctest 😉

# -*- coding: utf-8 -*-

def testDocTestUnicode():
    ur"""Non ascii letters in doctests actually are tricky. The reason why
        things work here that usually don't (each marked with a #BAD!) is
        explained quite in the end of this doctest, but the essence is: we
        didn't only fix the encoding of this file, but also the
        sys.defaultencoding, which you should never do.

        This file has a utf8 input encoding, which python is informed about by
        the first line: # -*- coding: utf-8 -*-. This means that for example an
        ä is 2 bytes: 11000011 10100100 (hexval "c3a4").

        There are two types of strings in Python 2.x: "" aka byte strings and
        u"" aka unicode string. For these two types two different things happen
        when parsing a file:

        If python encounters a non ascii char in a byte string (e.g., "ä") it
        will check if there's an input encoding given (yes, utf8) and then check
        if the 2 bytes ä is a valid utf-8 encoded char (yes it is). It will then
        simply keep the ä as its 2 byte utf-8 encoding in this byte-string
        internal representation. If you print it and you're lucky to have a utf8
        console you'll see an ä again. If you're not lucky and for example have
        a iso-8859-15 encoding on your console you'll see 2 strange chars
        (probably À) instead. So python will simply write the byte-string to
        output.

        >>> print "ä" #BAD!
        ä

        If there was no encoding given, we'd get a SyntaxError: Non-ASCII
        character 'xc3' in file ..., which is the first byte of our 2 byte ä.
        Where did the 'xc3' come from? Well, this is python's way of writing a
        non ascii byte to ascii output (which is always safe, so perfect for
        this error message): it will write a x and then two hex chars for each
        byte. Python does the same if we call:

        >>> print repr("ä")
        'xc3xa4'

        Or just
        >>> "ä"
        'xc3xa4'

        It also works the other way around, so you can give an arbitrary byte by
        using the same xXX escape sequences:
        >>> print "xc3xa4" #BAD!
        ä

        Oh look, we hit the utf8 representation of an ä, what a luck. You'll ask
        how do I then print "xc3xa4" to my console? You can either double all
        "" or tell python it's a raw string:
        >>> print "\xc3\xa4"
        xc3xa4
        >>> print r"xc3xa4"
        xc3xa4



        If python encounters a unicode string in our document (e.g., u"ä") it
        will use the specified file encoding to convert our 2 byte utf8 ä into a
        unicode string. This is the same as calling "ä".decode(myFileEncoding):
        >>> print u"ä" # BAD for another reason!
        ä
        >>> u"ä"
        u'xe4'
        >>> "ä".decode("utf-8")
        u'xe4'

        Python's internal unicode representation of this string is never exposed
        to the user (it could be UTF-16 or 32 or anything else, anyone?).
        The hex e4 corresponds to 11100100, the unicode ord value of the char ä,
        which is decimal 228.
        >>> ord(u'ä')
        228

        And the same again backwards, we can use the xXX escaping to denote a
        hex unicode point or raw not to interpret such escaping:
        >>> print u"xe4"
        ä
        >>> print ur"xe4"
        xe4

        Oh, noticed the difference? This time print did some magic. I told
        you, you'll never see python's internal representation of a unicode
        string. So whenever print receives a unicode string it will try to
        convert it to your output encoding (sys.out.encoding), which works in a
        terminal, but won't work if you're for example redirecting output to a
        file. In such cases you have to convert the string into the desired
        encoding explicitly:
        >>> u"ä".encode("utf8")
        'xc3xa4'
        >>> print u"ä".encode("utf8") #BAD!
        ä

        If that last line confused you a bit: We converted the unicode string
        to a byte-string, which was then simply copied byte-wise by print and
        voila, we got an ä.



        This all is done before the string even reaches doctest.
        So you might have written something like all the above in doctests,
        and probably saw them failing. In most cases you probably just 
        forgot the ur'''prefix''', but sometimes you had it and were confused.
        Well this is good, as all of the above #BAD! examples don't make much sense.

        Bummer, right.

        The reason is: we made assumptions on the default encoding all over the
        place, which is not a thing you would ever want to do in production
        code. We did this by setting sys.setdefaultencoding("UTF-8")
        below. Without this you'll usually get unicode warnings like this one:
        "UnicodeWarning: Unicode equal comparison failed to convert both
        arguments to Unicode - interpreting them as being unequal".
        Just fire up a python interpreter (not pydev, as I noticed it seems to
        fiddle with the default setting).
        Try: u"ä" == "ä"
        You should get:
            __main__:1: UnicodeWarning: Unicode equal comparison failed to convert both
                arguments to Unicode - interpreting them as being unequal
            False

        This actually is very good, as it warns you that you're comparing some
        byte-string from whatever location (could be a file) to a unicode string.
        Shall python guess the encoding? Silently? Probably a bad idea.

        Now if you do the following in your python interpreter:
            import sys
            reload(sys)
            sys.setdefaultencoding("utf8")
            u"ä" == "ä"
        You should get:
            True

        No wonder, you explicitly told python to interpret the "ä" as utf8
        encoded when nothing else specified.

        So what's the problem in our docstrings again? We had these bad
        examples:

        >>> print "ä" #BAD!
        ä
        >>> print "xc3xa4" #BAD!
        ä
        >>> print u"ä".encode("utf8") #BAD!
        ä

        Well, we're in a ur'''docstring''' here, so what doctest does is: it
        takes the part after >>> and exec(utes) it. There's one special feature
        of exec i wasn't aware of: if you pass a unicode string to it, it will
        revert the char back to utf-8:

        >>> exec u'print repr("ä")'
        'xc3xa4'
        >>> exec u'print repr("xe4")'
        'xc3xa4'

        This means that even though one might think that print "ä" in this
        unicode docstring will get print "xe4", it will print as if you wrote
        print "ä" outside of a unicode string, so as if you wrote print
        "xc3xa4". Let this twist your mind for a second. The doctest will
        execute as if there had been no conversion to a unicode string, which is
        what you want. But now comes the comparison. It will see what comes out
        of that and compare to the next line from this docstring, which now is a
        unicode "ä", so xe4. Hence we're now comparing u'xe4' == 'xc3xa4'.
        If you didn't notice, this is the same we did in the python interpreter
        above: we were comparing u"ä" == "ä". And again python tells us "Hmm,
        don't know shall I guess how to convert "ä" to u"ä"? Probably not, so
        evaluate to False.


        Summary:
        Always specify the source encoding: # -*- coding: utf-8 -*-
        and _ALWAYS_, no excuse, use utf-8. Repeat it: I will never use
        iso-8859-x, latin-1 or anything else, I'll use UTF-8 so I can write
        Jörn and he can actually read his name once.
        Use ur'''...''' surrounded docstrings (so a raw unicode docstring).
        You can also use ru'''...''', but I always think Russian strings?
        Never compare a unicode string with a byte string. This means: don't
        use u"ä" and "ä" mixed, they're not the same. Also the result line can
        only match unicode strings plain ascii, no other encoding.

        The following are bad comparisons, as they will compare byte- and
        unicode strings. They'll cause warnings and eval to false:
        #>>> u"ä" == "ä"
        #False
        #>>> "ä".decode("utf8") == "ä" 
        #False
        #>>> print "ä"
        #ä


        So finally a few working examples:  

        >>> "ä" # if file encoding is utf8
        'xc3xa4'
        >>> u"ä"
        u'xe4'

        Here both are unicode, so no problem, but nevertheless a bad idea to
        match output of print due to the print magic mentioned above and think
        about i18n: time formats, commas, dots, float precision, etc. 
        >>> print u"ä" # unicode even after exec, no prob.
        ä

        Better:
        >>> "ä" == "ä" # compares byte-strings
        True
        >>> u"ä".encode("utf8") == "ä" # compares byte-strings
        True
        >>> u"ä" == u"ä" # compares unicode-strings
        True
        >>> "ä".decode("utf8") == u"ä" # compares unicode-strings
        True
    """
    pass


if __name__ == "__main__":
    import sys
    reload(sys)
    sys.setdefaultencoding("UTF-8") # DON'T DO THIS. READ THE ABOVE @UndefinedVariable
    import doctest
    doctest.testmod()

How to restrict the length of a unicode string

Ha, not with me!

It’s a pretty common tripwire: Imagine you have a unicode string and for whatever reason (which should be a good reason, so make sure you really need this) you need to make sure that its UTF-8 representation has at most maxsize bytes.
The first and in this case worst attempt is probably unicodeStr[:maxsize], as its UTF-8 representation could be up to 6 times as long.
So the next worse attempt could be this unicode(unicodeStr.encode("utf-8")[:maxsize], "utf-8"): This could cut a multi-byte UTF-8 representation of a codepoint in half (example: unicode(u"jörn".encode("utf-8")[:2], "utf-8")). Luckily python will tell you by throwing a UnicodeDecodeError.

The last attempt actually wasn’t that wrong, as it only lacked the errors="ignore" flag:

unicode(myUnicodeStr.encode("utf-8")[:maxsize], "utf-8", errors="ignore")

One might think we’re done now, but this depends on your Unicode Normalization Form: Unicode allows Combined Characters, for example the precomposed u"ü" could be represented by the decomposed sequence u"u" and u"¨" (see Unicode Normalization).
In my case I know that my unicode strings are in Unicode Normalization Form C (NFC) (at least the RDF Literal Specs say so. This means that if there is a precomposed char it will be used. Nevertheless Unicode potentially allows for Combined characters which do not have a precomposed canonical equivalent. In this case not even normalizing would help, multiple unicode chars would remain, leading to multiple multi-byte UTF-8 chars. In this case I’m unsure what’s the universal solution… for such a u”ü” is it better to have a u”u” or nothing in case of a split? You have to decide.
I decided for having an “u” in the hopefully very rare case this occurs.
So use the following with care:

def truncateUTF8length(unicodeStr, maxsize):
    ur""" This method can be used to truncate the length of a given unicode
        string such that the corresponding utf-8 string won't exceed
        maxsize bytes. It will take care of multi-byte utf-8 chars intersecting
        with the maxsize limit: either the whole char fits or it will be
        truncated completely. Make sure that unicodeStr is in Unicode
        Normalization Form C (NFC), else strange things can happen as
        mentioned in the examples below.
        Returns a unicode string, so if you need it encoded as utf-8, call
        .decode("utf-8") after calling this method.
        >>> truncateUTF8lengthIfNecessary(u"ö", 2) == (u"ö", False)
        True
        >>> truncateUTF8length(u"ö", 1) == u""
        True
        >>> u'u1ebf'.encode('utf-8') == 'xe1xbaxbf'
        True
        >>> truncateUTF8length(u'hiu1ebf', 2) == u"hi"
        True
        >>> truncateUTF8lengthIfNecessary(u'hiu1ebf', 3) == (u"hi", True)
        True
        >>> truncateUTF8length(u'hiu1ebf', 4) == u"hi"
        True
        >>> truncateUTF8length(u'hiu1ebf', 5) == u"hiu1ebf"
        True

        Make sure the unicodeStr is in NFC (see unicodedata.normalize("NFC", ...) ).
        The following would not be true, as e and u'u0301' would be seperate
        unicode chars. This could be handled with unicodedata.combining
        and a loop deleting chars from the end until after the first non
        combining char, but this is _not_ done here!
        #>>> u'eu0301'.encode('utf-8') == 'exccx81'
        #True
        #>>> truncateUTF8length(u'eu0301', 0) == u"" # not in NFC (u'xe9'), but in NFD
        #True
        #>>> truncateUTF8length(u'eu0301', 1) == u"" #decodes to utf-8: 
        #True
        #>>> truncateUTF8length(u'eu0301', 2) == u""
        #True
        #>>> truncateUTF8length(u'eu0301', 3) == u"eu0301"
        #True
        """
    return unicode(unicodeStr.encode("utf-8")[:maxsize], "utf-8", errors="ignore")

Unicode and UTF-8 is nice, but if you don’t pay attention it will cause your code to contain a lot of sleeping bugs. And yes, probably I’d care less if there was no “ö” in my name 😉

PS: Günther, this is SFW. :p

How to convert hex strings to binary ascii strings in python (incl. 8bit space)

As this comes across may way again and again:

How do you turn a hex string like "c3a4c3b6c3bc" into a nice binary string like this: "11000011 10100100 11000011 10110110 11000011 10111100"?

The solution is based on the Python 2.6 new string formatting:

>>> "{0:8b}".format(int("c3",16))
'11000011'

Which can be decomposed into 4 bit for each hex char like this: (notice the 04b, which means 0-padded 4chars long binary string):

>>> "{0:04b}".format(int("c",16)) + "{0:04b}".format(int("3",16))
'11000011'

OK, now we could easily do this for all hex chars "".join(["{0:04b}".format(int(c,16)) for c in "c3a4c3b6"]) and done, but usually we want a blank every 8 bits from the right to left… And looping from the right pairwise is a bit more complicated… Oh and what if the number of bits is uneven?
So the solution looks like this:

>>> binary = lambda x: " ".join(reversed( [i+j for i,j in zip( *[ ["{0:04b}".format(int(c,16)) for c in reversed("0"+x)][n::2] for n in [1,0] ] ) ] ))
>>> binary("c3a4c3b6c3bc")
'11000011 10100100 11000011 10110110 11000011 10111100'

It takes the hex string x, first of all concatenates a "0" to the left (for the uneven case), then reverses the string, converts every char into a 4-bit binary string, then collects all uneven indices of this list, zips them to all even indices, for each in the pairs-list concatenates them to 8-bit binary strings, reverses again and joins them together with a ” ” in between. In case of an even number the added 0 falls out, because there’s no one to zip with, if uneven it zips with the first hex-char.

Yupp, I like 1liners 😉

Update: Btw, it’s very easy to combine this with binascii.hexlify to get the binary representation of some byte-string:

>>> import binascii
>>> binascii.hexlify('jörn')
'6ac3b6726e'
>>> binary(binascii.hexlify('jörn'))
'01101010 11000011 10110110 01110010 01101110'

(URL)Encoding in python

Well, encodings are a never ending story and whenever you don’t want to waste time on them, it’s for sure that you’ll stumble over yet another tripwire. This time it is the encoding of URLs (note: even though related I’m not talking about the urlencode function). Perhaps you have seen something like this before:
http://de.wikipedia.org/wiki/Gerhard_Schr%C3%B6der which actually is the URI pendant to this IRI: http://de.wikipedia.org/wiki/Gehard_Schröder

Now what’s the problem, you might ask. The problem is that two things can happen here:
Either your browser (or the library you use) thinks: “hmm, this 'ö' is strange, let’s convert it into a '%C3%B6'” or your browser (or lib) doesn’t care and asks the server with the 'ö' in the URL, introducing a bit of non-determinism into your expectations, right?

More details here:

$ curl -I http://de.wikipedia.org/wiki/Gerhard_Schröder
HTTP/1.0 200 OK
Date: Thu, 22 Jul 2010 09:41:56 GMT
...
Last-Modified: Wed, 21 Jul 2010 11:50:31 GMT
Content-Length: 144996
...
Connection: close
$ curl -I http://de.wikipedia.org/wiki/Gerhard_Schr%C3%B6der
HTTP/1.0 200 OK
Date: Sat, 31 Jul 2010 00:24:47 GMT
...
Last-Modified: Thu, 29 Jul 2010 10:04:31 GMT
Content-Length: 144962
...
Connection: close

Notice how the Date, Last-Modified and Content-Length differ.

OK, so how do we deal with this? I’d say: let’s always ask for the “percentified” version… but before try to understand this:

# notice that my locale is en.UTF-8
>>> print "jörn"
jörn
>>> "jörn" # implicitly calls: print repr("jörn")
'jxc3xb6rn'
>>> print repr("jörn")
'jxc3xb6rn'
>>> u"jörn"
u'jxf6rn'
>>> print u"jörn"
jörn
>>> print u"jörn".encode("utf8")
jörn
>>> u"jörn".encode("utf8")
'jxc3xb6rn'
>>> "jörn".encode("utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
'jxc3xb6rn'.decode("utf8")
u'jxf6rn'

So, what happened here?
As my locale is set to use UTF-8 encoding, all my inputs are utf-8 encoded already.
If until now you might have wondered, why 'ö' is translated into '%C3%B6', you might have spotted that 'ö' corresponds to the utf-8 "xc3xb6", which actually is python’s in string escape sequence for non-ASCII chars: it refers to 2 bytes with the hex-code: c3b6 (binary: '11000011 10110110') (quite useful: "{0:b} {1:b}".format(int("c3", 16), int("b6",16))).
So in URLs these "xhh" are simply replaced by "%HH", so a percent and two uppercase ASCII-Chars indicating a hex-code. The unicode 'ö' (1 char, 1byte, unicode "xf6" ('11110110')) hence is first transformed into utf-8 (1char, 2byte, utf8: '11000011 10110110') by my OS, before entering it into python, internally kept in this form unless I use the u"" strings, and then represented in the URL with "%C3%B6" (6chars, 6byte, ASCII).
What this example also shows is the implicit print repr(var) performed by the interactive python interpreter when you simply enter some var and hit return.
Print will try to convert strings to the current locale if they’re Unicode-Strings (u""). Else python will not assume that the string has any specific encoding, but just stick with the encoding your OS chose. It will simply treat the string as it was received and write the byte-sequence to your sys.stdout.

So back to the manual quoting of URLs:

>>> import urllib as ul
>>> ul.quote("jörn")
'j%C3%B6rn'
>>> print ul.quote("jörn")
j%C3%B6rn

>>> ul.unquote('j%C3%B6rn')
'jxc3xb6rn'
>>> ul.unquote("jörn")
'jxc3xb6rn'
>>> print ul.unquote("jörn")
jörn

Itertools

Just recently came across the python itertools “tools for efficient looping” again. Generators have the advantage of not creating the whole list on definition, but on demand (in contrast to e.g., list comprehensions). Really worth a look:

import itertools as it
g = it.cycle("abc") # a generator
g.next() # a
g.next() # b
g.next() # c
g.next() # a
g.next() # b
# ... and so on

g = it.cycle("abcde")
h = it.cycle("1234")
gh = it.izip(g,h) # iterzips n iterables together
gh.next() # (a,1)
gh.next() # (b,2)
# ... think about what this means with primes
gh.next() # (e,4)
gh.next() # (a,1)
# ...

Also very nice are the combinatoric generators:

it.product('ABCD',  repeat=2) # AA AB AC AD BA BB BC BD
                              # CA CB CC CD DA DB DC DD
it.permutations('ABCD',  2)   # AB AC AD BA BC BD CA CB CD DA DB DC
it.combinations('ABCD',  2)   # AB AC AD BC BD CD
it.combinations_with_replacement('ABCD',  2) # AA AB AC AD BB BC BD CC CD DD

Precision-Recall diagrams including F-Measure height lines

Today I was asked how to generate Recall-Precision diagrams including the f-measure values as height-lines from within python. Actually Gunnar was the one who had this idea quite a while ago, but constantly writing things into files, then loading them with his R code to visualize them, made me create a “pythonified” version. Looks like this (click for large version):

Here’s my python code-snippet for it: recallPrecision.py.

Uses a bunch of pylab internally, but after simply importing this module, you can easily visualize a list of (precision, recall) points like this:

import scipy as sc
import pylab as pl
import recallPrecision as rp
prs = sc.rand(15,2) # precision recall point list
labels = ["item " + str(i) for i in range(15)] # labels for the points
rp.plotPrecisionRecallDiagram("footitle", prs, labels)
pl.show()

(updated on 2014-10-28: link to gist)

Sort python dictionaries by values

Perhaps you already encountered a problem like the following one yourself:
You have a large list of items (let’s say URIs for this example) and want to sum up how often they were viewed (or edited or… whatever). A small one-shot solution in python looks like the following and uses the often unknown operator.itemgetter:

import sys
import operator
uriViews = {}
for line in sys.stdin:
    uri, views = line.strip().split()
    views = int(views)
    uriViews[uri] = uriViews.get(uri, 0) - views
    # why minus? could do + and use reverse=True below?
    # right, but then items also get sorted in reverse, if views are
    # the same (so Z would come first, a last)
for uri, views in sorted(uriViews.items(),
                         key=operator.itemgetter(1,0)):
    print -views, uri

This approach can be a lot faster than self written lambda functions called for every comparison or a list comprehension which turns around all tuples and then sorts it. Also in contrast to many other solutions you can find online this one uses operator.itemgetter(1,0), which means that if two items have the same amount of views, their URIs are sorted alphabetically.

Remember that this approach sorts the whole list in RAM, so you might want to chose other approaches in case your lists are getting huge.
For further information you might want to read PEP-0265, which also includes a hint what to do when you’re only interested in the Top 1000 for example (will only sort these top1000):

import heapq
top1000 = heapq.nlargest(1000, uriViews.iteritems(), itemgetter(1,0))
for uri,views in top1000:
   print -views, uri