EtherPad: live collaborative text editing

Ever thought that it would be cool to just collaborate with others while writing a document? Well ok, there are wikis but I mean real-time. Not that it’s new or anything (google docs, wave), but EtherPad recently became one of my favorites for this (thanks to Andreas Wagner).

It’s a no login website, you just need the URL and can start. You can host EtherPad yourself if you don’t trust the server.

It’s a great tool for brainstorming in a group (no more poor guy having to log everything, simply add it yourself), writing down some thoughts, coordinating things in a very interactive way.

Mozilla put up a public EtherPad, just try it here and don’t miss the cool time-slider.

Bash prompt indicating return value

Lately I’ve fiddled a lot with installing virtuoso on some virtual machines and found myself repeatedly asking bash for the return value of the last command echo $?. I remembered this blog post by Gecko quite a while ago and tuned it a bit to my needs. My user prompt now looks like this (it’s a slightly modified version of the old gentoo defaults, that I prefer over the ubuntu defaults, which only remind you that you’re root with a # instead of a $):

The code for the user prompt:

PS1='${debian_chroot:+($debian_chroot)}\[\033[01;32m]u@h\[\033[00m]:\[\033[01;34m]w $(if [[ $? = 0 ]];then echo "\[\033[01;34m]";else echo "\[\033[01;31m]";fi)$ \[\033[00m]'

And the root prompt:

PS1='\[\033[01;31m]h\[\033[01;34m] W $(if [[ $? = 0 ]];then echo "\[\033[01;34m]";else echo "\[\033[01;31m]";fi)$\[\033[00m] '

And a small doc snippet you might want to include with your PS1 in your .bashrc, so you can still understand that cryptic stuff in a few days:

# h hostname
# u user
# w working dir (home == ~)
# $ a $ if UID == 0 else #
# A current 24-time: HH:MM
# \ a 
# [ begin, ] end of non-printing (control chars)
#
# colors: have to be surrounded by: '\[\033[' and 'm]' (without the ''. These literally are the left and right bracket!)
#  color codes are as follows, preceeded by a '0;' (dark) (default) or a '1;' (light)
#  FG and BG NULL: 00 resets
#  FG: 30 + ... 0: black, 1: red, 2: green, 3: yellow, 4: blue, 5: violet, 6: cyan, 7: white
#  BG: 40 + ... same

P != NP solved?

Well, it happens every now and then, that someone finds a proof for a long standing theoretical problem.
This time it’s the famous P = NP or P != NP problem of computer science that Gecko pointed me to:

Let’s wish him luck that his proof is flawless and accepted.
(Who’s the first one to say: “I’ve always known it” ? :D)

Sadly my wishes didn’t help :-/

WTFPL

Thanks to Ralf I came across this really nice license today. If you’ve ever been lost in the licensing jungle of any software, you’ll understand:
The WTFPL (Homepage includes a nice FAQ section as well ;)):

            DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE 
                    Version 2, December 2004 

 Copyright (C) 2004 Sam Hocevar <sam@hocevar.net> 

 Everyone is permitted to copy and distribute verbatim or modified 
 copies of this license document, and changing it is allowed as long 
 as the name is changed. 

            DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE 
   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 

  0. You just DO WHAT THE FUCK YOU WANT TO.

(URL)Encoding in python

Well, encodings are a never ending story and whenever you don’t want to waste time on them, it’s for sure that you’ll stumble over yet another tripwire. This time it is the encoding of URLs (note: even though related I’m not talking about the urlencode function). Perhaps you have seen something like this before:
http://de.wikipedia.org/wiki/Gerhard_Schr%C3%B6der which actually is the URI pendant to this IRI: http://de.wikipedia.org/wiki/Gehard_Schröder

Now what’s the problem, you might ask. The problem is that two things can happen here:
Either your browser (or the library you use) thinks: “hmm, this 'ö' is strange, let’s convert it into a '%C3%B6'” or your browser (or lib) doesn’t care and asks the server with the 'ö' in the URL, introducing a bit of non-determinism into your expectations, right?

More details here:

$ curl -I http://de.wikipedia.org/wiki/Gerhard_Schröder
HTTP/1.0 200 OK
Date: Thu, 22 Jul 2010 09:41:56 GMT
...
Last-Modified: Wed, 21 Jul 2010 11:50:31 GMT
Content-Length: 144996
...
Connection: close
$ curl -I http://de.wikipedia.org/wiki/Gerhard_Schr%C3%B6der
HTTP/1.0 200 OK
Date: Sat, 31 Jul 2010 00:24:47 GMT
...
Last-Modified: Thu, 29 Jul 2010 10:04:31 GMT
Content-Length: 144962
...
Connection: close

Notice how the Date, Last-Modified and Content-Length differ.

OK, so how do we deal with this? I’d say: let’s always ask for the “percentified” version… but before try to understand this:

# notice that my locale is en.UTF-8
>>> print "jörn"
jörn
>>> "jörn" # implicitly calls: print repr("jörn")
'jxc3xb6rn'
>>> print repr("jörn")
'jxc3xb6rn'
>>> u"jörn"
u'jxf6rn'
>>> print u"jörn"
jörn
>>> print u"jörn".encode("utf8")
jörn
>>> u"jörn".encode("utf8")
'jxc3xb6rn'
>>> "jörn".encode("utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
'jxc3xb6rn'.decode("utf8")
u'jxf6rn'

So, what happened here?
As my locale is set to use UTF-8 encoding, all my inputs are utf-8 encoded already.
If until now you might have wondered, why 'ö' is translated into '%C3%B6', you might have spotted that 'ö' corresponds to the utf-8 "xc3xb6", which actually is python’s in string escape sequence for non-ASCII chars: it refers to 2 bytes with the hex-code: c3b6 (binary: '11000011 10110110') (quite useful: "{0:b} {1:b}".format(int("c3", 16), int("b6",16))).
So in URLs these "xhh" are simply replaced by "%HH", so a percent and two uppercase ASCII-Chars indicating a hex-code. The unicode 'ö' (1 char, 1byte, unicode "xf6" ('11110110')) hence is first transformed into utf-8 (1char, 2byte, utf8: '11000011 10110110') by my OS, before entering it into python, internally kept in this form unless I use the u"" strings, and then represented in the URL with "%C3%B6" (6chars, 6byte, ASCII).
What this example also shows is the implicit print repr(var) performed by the interactive python interpreter when you simply enter some var and hit return.
Print will try to convert strings to the current locale if they’re Unicode-Strings (u""). Else python will not assume that the string has any specific encoding, but just stick with the encoding your OS chose. It will simply treat the string as it was received and write the byte-sequence to your sys.stdout.

So back to the manual quoting of URLs:

>>> import urllib as ul
>>> ul.quote("jörn")
'j%C3%B6rn'
>>> print ul.quote("jörn")
j%C3%B6rn

>>> ul.unquote('j%C3%B6rn')
'jxc3xb6rn'
>>> ul.unquote("jörn")
'jxc3xb6rn'
>>> print ul.unquote("jörn")
jörn

Itertools

Just recently came across the python itertools “tools for efficient looping” again. Generators have the advantage of not creating the whole list on definition, but on demand (in contrast to e.g., list comprehensions). Really worth a look:

import itertools as it
g = it.cycle("abc") # a generator
g.next() # a
g.next() # b
g.next() # c
g.next() # a
g.next() # b
# ... and so on

g = it.cycle("abcde")
h = it.cycle("1234")
gh = it.izip(g,h) # iterzips n iterables together
gh.next() # (a,1)
gh.next() # (b,2)
# ... think about what this means with primes
gh.next() # (e,4)
gh.next() # (a,1)
# ...

Also very nice are the combinatoric generators:

it.product('ABCD',  repeat=2) # AA AB AC AD BA BB BC BD
                              # CA CB CC CD DA DB DC DD
it.permutations('ABCD',  2)   # AB AC AD BA BC BD CA CB CD DA DB DC
it.combinations('ABCD',  2)   # AB AC AD BC BD CD
it.combinations_with_replacement('ABCD',  2) # AA AB AC AD BB BC BD CC CD DD

Precision-Recall diagrams including F-Measure height lines

Today I was asked how to generate Recall-Precision diagrams including the f-measure values as height-lines from within python. Actually Gunnar was the one who had this idea quite a while ago, but constantly writing things into files, then loading them with his R code to visualize them, made me create a “pythonified” version. Looks like this (click for large version):

Here’s my python code-snippet for it: recallPrecision.py.

https://gist.github.com/joernhees/c58a4ca92faf3e4c5331

Uses a bunch of pylab internally, but after simply importing this module, you can easily visualize a list of (precision, recall) points like this:

import scipy as sc
import pylab as pl
import recallPrecision as rp
prs = sc.rand(15,2) # precision recall point list
labels = ["item " + str(i) for i in range(15)] # labels for the points
rp.plotPrecisionRecallDiagram("footitle", prs, labels)
pl.show()

(updated on 2014-10-28: link to gist)

Sort python dictionaries by values

Perhaps you already encountered a problem like the following one yourself:
You have a large list of items (let’s say URIs for this example) and want to sum up how often they were viewed (or edited or… whatever). A small one-shot solution in python looks like the following and uses the often unknown operator.itemgetter:

import sys
import operator
uriViews = {}
for line in sys.stdin:
    uri, views = line.strip().split()
    views = int(views)
    uriViews[uri] = uriViews.get(uri, 0) - views
    # why minus? could do + and use reverse=True below?
    # right, but then items also get sorted in reverse, if views are
    # the same (so Z would come first, a last)
for uri, views in sorted(uriViews.items(),
                         key=operator.itemgetter(1,0)):
    print -views, uri

This approach can be a lot faster than self written lambda functions called for every comparison or a list comprehension which turns around all tuples and then sorts it. Also in contrast to many other solutions you can find online this one uses operator.itemgetter(1,0), which means that if two items have the same amount of views, their URIs are sorted alphabetically.

Remember that this approach sorts the whole list in RAM, so you might want to chose other approaches in case your lists are getting huge.
For further information you might want to read PEP-0265, which also includes a hint what to do when you’re only interested in the Top 1000 for example (will only sort these top1000):

import heapq
top1000 = heapq.nlargest(1000, uriViews.iteritems(), itemgetter(1,0))
for uri,views in top1000:
   print -views, uri