Another one of Luis von Ahn‘s ingenious projects: http://duolingo.com learn a language for free and translate the web in the background.
There is a pretty recent TED talk by him, and below you can find their introductory video on youtube:
Duolingo: Learn a language and translate the web
Interesting talk about “Filter Bubbles”
A few days ago I stumbled over an interesting TED talk by Eli Pariser about the ever increasing personalization of the web, its search results, your facebook news feed, … Do you think that you still see the whole picture or are you already caught in your own filtered information bubble? (thx to Kingsley Idehen)
Mac OS X Harddisk high Load Cycle Counts
Short summary: Mac OS X’s default power management settings might wear your hard drive down unnecessarily. This post provides a lot of background information and how to change these settings. Continue reading
Live mapping of tweets, facebook msgs, emails, sms…
Reading the Wikimedia blog I stumbled over this interesting post. They mention a framework called Ushahidi (Swahili word for “testimony’) with its subproject SwitfRiver which can be used to track and verify the reliability of news concerning current trending topics, possibly helping editors of Wikipedia to enhance the quality.
Digging into I found out the framework is used for live mapping (collection, aggregation and visualization) of disaster and event related messages sent via all different kinds of transports (e.g., twitter, facebook, email, sms…). One example is the 2010 Haiti earthquake. Where it helped to coordinate all the s&r teams.
As I find it quite fascinating how much people who sit at home in their living rooms might be able to help others in a disaster region, I’d like to suggest this talk:
LaTeX Thesis Skeleton
As it might be useful for other students (especially for computer science students at the University of Kaiserslautern), I decided to invest some time and create a skeleton for a thesis.
The project can be found on github: http://github.com/joernhees/thesis-skeleton.
I’ll happily include / pull changes.
Quick instructions to get started with your thesis:
- Make sure you have git, otherwise install it (e.g., on ubuntu:
sudo aptitude install git-core) - Run this:
git clone git://github.com/joernhees/thesis-skeleton.git myMasterThesis
It will create a directory called
myMasterThesisin the current directory which actually is a git repository and includes a thesis directory. - Enter it and have a look at thesis.pdf
- Insert your name, title, supervisors, etc. in thesis.tex.
- Get familiar with git, this is a good start.
That’s it.
Interesting analysis of a post’s life cycle
Corte.si did it again.
This time a very interesting analysis of what happens when he posts on his blog and twitters about it.
Most interesting: the number of bots that access his page just seconds after he published it, where did the payload of human readers come from and what did it change in numbers of subscribers.
BetterRelations (beta): some updates
Well, in a hopefully last coding “flash” this night I included some frequently requested features, most important: a “can’t decide” button:
Enjoy
(also see the first post)
Introducing: BetterRelations – a Game with a Purpose
As many of you know I’m developing a game called BetterRelations for my MasterThesis. It is now available:
BetterRelations (alpha)
The game collects pairwise user preferences, which are then used to rate Linked Data triples by “Importance”. Would be cool if you find time to play the game maybe in the lunch break and help me collecting the data for my thesis.
Feedback and bug reports are heartily welcome. If you know other interested players feel free to forward the link or this post, the more people, the better
More to come, keep posted.
Python unicode doctest howto in a doctest
Another thing which has been on my stack for quite a while has been a unicode doctest howto, as I remember I was quite lost when I first tried to test encoding stuff in a doctest.
So I thought the ultimate way to show how to do this would be in a doctest
def testDocTestUnicode():
ur"""Non ascii letters in doctests actually are tricky. The reason why
things work here that usually don't (each marked with a #BAD!) is
explained quite in the end of this doctest, but the essence is: we
didn't only fix the encoding of this file, but also the
sys.defaultencoding, which you should never do.
This file has a utf8 input encoding, which python is informed about by
the first line: # -*- coding: utf-8 -*-. This means that for example an
ä is 2 bytes: 11000011 10100100 (hexval "c3a4").
There are two types of strings in Python 2.x: "" aka byte strings and
u"" aka unicode string. For these two types two different things happen
when parsing a file:
If python encounters a non ascii char in a byte string (e.g., "ä") it
will check if there's an input encoding given (yes, utf8) and then check
if the 2 bytes ä is a valid utf-8 encoded char (yes it is). It will then
simply keep the ä as its 2 byte utf-8 encoding in this byte-string
internal representation. If you print it and you're lucky to have a utf8
console you'll see an ä again. If you're not lucky and for example have
a iso-8859-15 encoding on your console you'll see 2 strange chars
(probably À) instead. So python will simply write the byte-string to
output.
>>> print "ä" #BAD!
ä
If there was no encoding given, we'd get a SyntaxError: Non-ASCII
character '\xc3' in file ..., which is the first byte of our 2 byte ä.
Where did the '\xc3' come from? Well, this is python's way of writing a
non ascii byte to ascii output (which is always safe, so perfect for
this error message): it will write a \x and then two hex chars for each
byte. Python does the same if we call:
>>> print repr("ä")
'\xc3\xa4'
Or just
>>> "ä"
'\xc3\xa4'
It also works the other way around, so you can give an arbitrary byte by
using the same \xXX escape sequences:
>>> print "\xc3\xa4" #BAD!
ä
Oh look, we hit the utf8 representation of an ä, what a luck. You'll ask
how do I then print "\xc3\xa4" to my console? You can either double all
"\" or tell python it's a raw string:
>>> print "\\xc3\\xa4"
\xc3\xa4
>>> print r"\xc3\xa4"
\xc3\xa4
If python encounters a unicode string in our document (e.g., u"ä") it
will use the specified file encoding to convert our 2 byte utf8 ä into a
unicode string. This is the same as calling "ä".decode(myFileEncoding):
>>> print u"ä" # BAD for another reason!
ä
>>> u"ä"
u'\xe4'
>>> "ä".decode("utf-8")
u'\xe4'
Python's internal unicode representation of this string is never exposed
to the user (it could be UTF-16 or 32 or anything else, anyone?).
The hex e4 corresponds to 11100100, the unicode ord value of the char ä,
which is decimal 228.
>>> ord(u'ä')
228
And the same again backwards, we can use the \xXX escaping to denote a
hex unicode point or raw not to interpret such escaping:
>>> print u"\xe4"
ä
>>> print ur"\xe4"
\xe4
Oh, noticed the difference? This time print did some magic. I told
you, you'll never see python's internal representation of a unicode
string. So whenever print receives a unicode string it will try to
convert it to your output encoding (sys.out.encoding), which works in a
terminal, but won't work if you're for example redirecting output to a
file. In such cases you have to convert the string into the desired
encoding explicitly:
>>> u"ä".encode("utf8")
'\xc3\xa4'
>>> print u"ä".encode("utf8") #BAD!
ä
If that last line confused you a bit: We converted the unicode string
to a byte-string, which was then simply copied byte-wise by print and
voila, we got an ä.
This all is done before the string even reaches doctest.
So you might have written something like all the above in doctests,
and probably saw them failing. In most cases you probably just
forgot the ur'''prefix''', but sometimes you had it and were confused.
Well this is good, as all of the above #BAD! examples don't make much sense.
Bummer, right.
The reason is: we made assumptions on the default encoding all over the
place, which is not a thing you would ever want to do in production
code. We did this by setting sys.setdefaultencoding("UTF-8")
below. Without this you'll usually get unicode warnings like this one:
"UnicodeWarning: Unicode equal comparison failed to convert both
arguments to Unicode - interpreting them as being unequal".
Just fire up a python interpreter (not pydev, as I noticed it seems to
fiddle with the default setting).
Try: u"ä" == "ä"
You should get:
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both
arguments to Unicode - interpreting them as being unequal
False
This actually is very good, as it warns you that you're comparing some
byte-string from whatever location (could be a file) to a unicode string.
Shall python guess the encoding? Silently? Probably a bad idea.
Now if you do the following in your python interpreter:
import sys
reload(sys)
sys.setdefaultencoding("utf8")
u"ä" == "ä"
You should get:
True
No wonder, you explicitly told python to interpret the "ä" as utf8
encoded when nothing else specified.
So what's the problem in our docstrings again? We had these bad
examples:
>>> print "ä" #BAD!
ä
>>> print "\xc3\xa4" #BAD!
ä
>>> print u"ä".encode("utf8") #BAD!
ä
Well, we're in a ur'''docstring''' here, so what doctest does is: it
takes the part after >>> and exec(utes) it. There's one special feature
of exec i wasn't aware of: if you pass a unicode string to it, it will
revert the char back to utf-8:
>>> exec u'print repr("ä")'
'\xc3\xa4'
>>> exec u'print repr("\xe4")'
'\xc3\xa4'
This means that even though one might think that print "ä" in this
unicode docstring will get print "\xe4", it will print as if you wrote
print "ä" outside of a unicode string, so as if you wrote print
"\xc3\xa4". Let this twist your mind for a second. The doctest will
execute as if there had been no conversion to a unicode string, which is
what you want. But now comes the comparison. It will see what comes out
of that and compare to the next line from this docstring, which now is a
unicode "ä", so \xe4. Hence we're now comparing u'\xe4' == '\xc3\xa4'.
If you didn't notice, this is the same we did in the python interpreter
above: we were comparing u"ä" == "ä". And again python tells us "Hmm,
don't know shall I guess how to convert "ä" to u"ä"? Probably not, so
evaluate to False.
Summary:
Always specify the source encoding: # -*- coding: utf-8 -*-
and _ALWAYS_, no excuse, use utf-8. Repeat it: I will never use
iso-8859-x, latin-1 or anything else, I'll use UTF-8 so I can write
Jörn and he can actually read his name once.
Use ur'''...''' surrounded docstrings (so a raw unicode docstring).
You can also use ru'''...''', but I always think Russian strings?
Never compare a unicode string with a byte string. This means: don't
use u"ä" and "ä" mixed, they're not the same. Also the result line can
only match unicode strings plain ascii, no other encoding.
The following are bad comparisons, as they will compare byte- and
unicode strings. They'll cause warnings and eval to false:
#>>> u"ä" == "ä"
#False
#>>> "ä".decode("utf8") == "ä"
#False
#>>> print "ä"
#ä
So finally a few working examples:
>>> "ä" # if file encoding is utf8
'\xc3\xa4'
>>> u"ä"
u'\xe4'
Here both are unicode, so no problem, but nevertheless a bad idea to
match output of print due to the print magic mentioned above and think
about i18n: time formats, commas, dots, float precision, etc.
>>> print u"ä" # unicode even after exec, no prob.
ä
Better:
>>> "ä" == "ä" # compares byte-strings
True
>>> u"ä".encode("utf8") == "ä" # compares byte-strings
True
>>> u"ä" == u"ä" # compares unicode-strings
True
>>> "ä".decode("utf8") == u"ä" # compares unicode-strings
True
"""
pass
if __name__ == "__main__":
import sys
reload(sys)
sys.setdefaultencoding("UTF-8") # DON'T DO THIS. READ THE ABOVE @UndefinedVariable
import doctest
doctest.testmod()
How to restrict the length of a unicode string
Ha, not with me!
It’s a pretty common tripwire: Imagine you have a unicode string and for whatever reason (which should be a good reason, so make sure you really need this) you need to make sure that its UTF-8 representation has at most maxsize bytes.
The first and in this case worst attempt is probably unicodeStr[:maxsize], as its UTF-8 representation could be up to 6 times as long.
So the next worse attempt could be this unicode(unicodeStr.encode("utf-8")[:maxsize], "utf-8"): This could cut a multi-byte UTF-8 representation of a codepoint in half (example: unicode(u"jörn".encode("utf-8")[:2], "utf-8")). Luckily python will tell you by throwing a UnicodeDecodeError.
The last attempt actually wasn’t that wrong, as it only lacked the errors="ignore" flag:
One might think we’re done now, but this depends on your Unicode Normalization Form: Unicode allows Combined Characters, for example the precomposed u"ü" could be represented by the decomposed sequence u"u" and u"¨" (see Unicode Normalization).
In my case I know that my unicode strings are in Unicode Normalization Form C (NFC) (at least the RDF Literal Specs say so. This means that if there is a precomposed char it will be used. Nevertheless Unicode potentially allows for Combined characters which do not have a precomposed canonical equivalent. In this case not even normalizing would help, multiple unicode chars would remain, leading to multiple multi-byte UTF-8 chars. In this case I’m unsure what’s the universal solution… for such a u”ü” is it better to have a u”u” or nothing in case of a split? You have to decide.
I decided for having an “u” in the hopefully very rare case this occurs.
So use the following with care:
ur""" This method can be used to truncate the length of a given unicode
string such that the corresponding utf-8 string won't exceed
maxsize bytes. It will take care of multi-byte utf-8 chars intersecting
with the maxsize limit: either the whole char fits or it will be
truncated completely. Make sure that unicodeStr is in Unicode
Normalization Form C (NFC), else strange things can happen as
mentioned in the examples below.
Returns a unicode string, so if you need it encoded as utf-8, call
.decode("utf-8") after calling this method.
>>> truncateUTF8lengthIfNecessary(u"ö", 2) == (u"ö", False)
True
>>> truncateUTF8length(u"ö", 1) == u""
True
>>> u'\u1ebf'.encode('utf-8') == '\xe1\xba\xbf'
True
>>> truncateUTF8length(u'hi\u1ebf', 2) == u"hi"
True
>>> truncateUTF8lengthIfNecessary(u'hi\u1ebf', 3) == (u"hi", True)
True
>>> truncateUTF8length(u'hi\u1ebf', 4) == u"hi"
True
>>> truncateUTF8length(u'hi\u1ebf', 5) == u"hi\u1ebf"
True
Make sure the unicodeStr is in NFC (see unicodedata.normalize("NFC", ...) ).
The following would not be true, as e and u'\u0301' would be seperate
unicode chars. This could be handled with unicodedata.combining
and a loop deleting chars from the end until after the first non
combining char, but this is _not_ done here!
#>>> u'e\u0301'.encode('utf-8') == 'e\xcc\x81'
#True
#>>> truncateUTF8length(u'e\u0301', 0) == u"" # not in NFC (u'\xe9'), but in NFD
#True
#>>> truncateUTF8length(u'e\u0301', 1) == u"" #decodes to utf-8:
#True
#>>> truncateUTF8length(u'e\u0301', 2) == u""
#True
#>>> truncateUTF8length(u'e\u0301', 3) == u"e\u0301"
#True
"""
return unicode(unicodeStr.encode("utf-8")[:maxsize], "utf-8", errors="ignore")
Unicode and UTF-8 is nice, but if you don’t pay attention it will cause your code to contain a lot of sleeping bugs. And yes, probably I’d care less if there was no “ö” in my name
PS: Günther, this is SFW. :p

