Tag Archives: encoding

Python unicode doctest howto in a doctest

Another thing which has been on my stack for quite a while has been a unicode doctest howto, as I remember I was quite lost when I first tried to test encoding stuff in a doctest.
So I thought the ultimate way to show how to do this would be in a doctest ;)

# -*- coding: utf-8 -*-

def testDocTestUnicode():
    ur"""Non ascii letters in doctests actually are tricky. The reason why
        things work here that usually don't (each marked with a #BAD!) is
        explained quite in the end of this doctest, but the essence is: we
        didn't only fix the encoding of this file, but also the
        sys.defaultencoding, which you should never do.
        This file has a utf8 input encoding, which python is informed about by
        the first line: # -*- coding: utf-8 -*-. This means that for example an
        ä is 2 bytes: 11000011 10100100 (hexval "c3a4").
        There are two types of strings in Python 2.x: "" aka byte strings and
        u"" aka unicode string. For these two types two different things happen
        when parsing a file:
        If python encounters a non ascii char in a byte string (e.g., "ä") it
        will check if there's an input encoding given (yes, utf8) and then check
        if the 2 bytes ä is a valid utf-8 encoded char (yes it is). It will then
        simply keep the ä as its 2 byte utf-8 encoding in this byte-string
        internal representation. If you print it and you're lucky to have a utf8
        console you'll see an ä again. If you're not lucky and for example have
        a iso-8859-15 encoding on your console you'll see 2 strange chars
        (probably À) instead. So python will simply write the byte-string to
        >>> print "ä" #BAD!
        If there was no encoding given, we'd get a SyntaxError: Non-ASCII
        character 'xc3' in file ..., which is the first byte of our 2 byte ä.
        Where did the 'xc3' come from? Well, this is python's way of writing a
        non ascii byte to ascii output (which is always safe, so perfect for
        this error message): it will write a x and then two hex chars for each
        byte. Python does the same if we call:
        >>> print repr("ä")
        Or just
        >>> "ä"
        It also works the other way around, so you can give an arbitrary byte by
        using the same xXX escape sequences:
        >>> print "xc3xa4" #BAD!
        Oh look, we hit the utf8 representation of an ä, what a luck. You'll ask
        how do I then print "xc3xa4" to my console? You can either double all
        "" or tell python it's a raw string:
        >>> print "\xc3\xa4"
        >>> print r"xc3xa4"
        If python encounters a unicode string in our document (e.g., u"ä") it
        will use the specified file encoding to convert our 2 byte utf8 ä into a
        unicode string. This is the same as calling "ä".decode(myFileEncoding):
        >>> print u"ä" # BAD for another reason!
        >>> u"ä"
        >>> "ä".decode("utf-8")
        Python's internal unicode representation of this string is never exposed
        to the user (it could be UTF-16 or 32 or anything else, anyone?).
        The hex e4 corresponds to 11100100, the unicode ord value of the char ä,
        which is decimal 228.
        >>> ord(u'ä')
        And the same again backwards, we can use the xXX escaping to denote a
        hex unicode point or raw not to interpret such escaping:
        >>> print u"xe4"
        >>> print ur"xe4"
        Oh, noticed the difference? This time print did some magic. I told
        you, you'll never see python's internal representation of a unicode
        string. So whenever print receives a unicode string it will try to
        convert it to your output encoding (sys.out.encoding), which works in a
        terminal, but won't work if you're for example redirecting output to a
        file. In such cases you have to convert the string into the desired
        encoding explicitly:
        >>> u"ä".encode("utf8")
        >>> print u"ä".encode("utf8") #BAD!
        If that last line confused you a bit: We converted the unicode string
        to a byte-string, which was then simply copied byte-wise by print and
        voila, we got an ä.
        This all is done before the string even reaches doctest.
        So you might have written something like all the above in doctests,
        and probably saw them failing. In most cases you probably just
        forgot the ur'''prefix''', but sometimes you had it and were confused.
        Well this is good, as all of the above #BAD! examples don't make much sense.
        Bummer, right.
        The reason is: we made assumptions on the default encoding all over the
        place, which is not a thing you would ever want to do in production
        code. We did this by setting sys.setdefaultencoding("UTF-8")
        below. Without this you'll usually get unicode warnings like this one:
        "UnicodeWarning: Unicode equal comparison failed to convert both
        arguments to Unicode - interpreting them as being unequal".
        Just fire up a python interpreter (not pydev, as I noticed it seems to
        fiddle with the default setting).
        Try: u"ä" == "ä"
        You should get:
            __main__:1: UnicodeWarning: Unicode equal comparison failed to convert both
                arguments to Unicode - interpreting them as being unequal
        This actually is very good, as it warns you that you're comparing some
        byte-string from whatever location (could be a file) to a unicode string.
        Shall python guess the encoding? Silently? Probably a bad idea.
        Now if you do the following in your python interpreter:
            import sys
            u"ä" == "ä"
        You should get:
        No wonder, you explicitly told python to interpret the "ä" as utf8
        encoded when nothing else specified.
        So what's the problem in our docstrings again? We had these bad
        >>> print "ä" #BAD!
        >>> print "xc3xa4" #BAD!
        >>> print u"ä".encode("utf8") #BAD!
        Well, we're in a ur'''docstring''' here, so what doctest does is: it
        takes the part after >>> and exec(utes) it. There's one special feature
        of exec i wasn't aware of: if you pass a unicode string to it, it will
        revert the char back to utf-8:
        >>> exec u'print repr("ä")'
        >>> exec u'print repr("xe4")'
        This means that even though one might think that print "ä" in this
        unicode docstring will get print "xe4", it will print as if you wrote
        print "ä" outside of a unicode string, so as if you wrote print
        "xc3xa4". Let this twist your mind for a second. The doctest will
        execute as if there had been no conversion to a unicode string, which is
        what you want. But now comes the comparison. It will see what comes out
        of that and compare to the next line from this docstring, which now is a
        unicode "ä", so xe4. Hence we're now comparing u'xe4' == 'xc3xa4'.
        If you didn't notice, this is the same we did in the python interpreter
        above: we were comparing u"ä" == "ä". And again python tells us "Hmm,
        don't know shall I guess how to convert "ä" to u"ä"? Probably not, so
        evaluate to False.
        Always specify the source encoding: # -*- coding: utf-8 -*-
        and _ALWAYS_, no excuse, use utf-8. Repeat it: I will never use
        iso-8859-x, latin-1 or anything else, I'll use UTF-8 so I can write
        Jörn and he can actually read his name once.
        Use ur'''...''' surrounded docstrings (so a raw unicode docstring).
        You can also use ru'''...''', but I always think Russian strings?
        Never compare a unicode string with a byte string. This means: don't
        use u"ä" and "ä" mixed, they're not the same. Also the result line can
        only match unicode strings plain ascii, no other encoding.
        The following are bad comparisons, as they will compare byte- and
        unicode strings. They'll cause warnings and eval to false:
        #>>> u"ä" == "ä"
        #>>> "ä".decode("utf8") == "ä"
        #>>> print "ä"
        So finally a few working examples:  
        >>> "ä" # if file encoding is utf8
        >>> u"ä"
        Here both are unicode, so no problem, but nevertheless a bad idea to
        match output of print due to the print magic mentioned above and think
        about i18n: time formats, commas, dots, float precision, etc.
        >>> print u"ä" # unicode even after exec, no prob.
        >>> "ä" == "ä" # compares byte-strings
        >>> u"ä".encode("utf8") == "ä" # compares byte-strings
        >>> u"ä" == u"ä" # compares unicode-strings
        >>> "ä".decode("utf8") == u"ä" # compares unicode-strings


if __name__ == "__main__":
    import sys
    sys.setdefaultencoding("UTF-8") # DON'T DO THIS. READ THE ABOVE @UndefinedVariable
    import doctest

How to restrict the length of a unicode string

Ha, not with me!
It’s a pretty common tripwire: Imagine you have a unicode string and for whatever reason (which should be a good reason, so make sure you really need this) you need to make sure that its UTF-8 representation has at most maxsize bytes.
The first and in this case worst attempt is probably unicodeStr[:maxsize], as its UTF-8 representation could be up to 6 times as long.
So the next worse attempt could be this unicode(unicodeStr.encode("utf-8")[:maxsize], "utf-8"): This could cut a multi-byte UTF-8 representation of a codepoint in half (example: unicode(u"jörn".encode("utf-8")[:2], "utf-8")). Luckily python will tell you by throwing a UnicodeDecodeError.

The last attempt actually wasn’t that wrong, as it only lacked the errors="ignore" flag:

unicode(myUnicodeStr.encode("utf-8")[:maxsize], "utf-8", errors="ignore")

One might think we’re done now, but this depends on your Unicode Normalization Form: Unicode allows Combined Characters, for example the precomposed u"ü" could be represented by the decomposed sequence u"u" and u"¨" (see Unicode Normalization).
In my case I know that my unicode strings are in Unicode Normalization Form C (NFC) (at least the RDF Literal Specs say so. This means that if there is a precomposed char it will be used. Nevertheless Unicode potentially allows for Combined characters which do not have a precomposed canonical equivalent. In this case not even normalizing would help, multiple unicode chars would remain, leading to multiple multi-byte UTF-8 chars. In this case I’m unsure what’s the universal solution… for such a u”ü” is it better to have a u”u” or nothing in case of a split? You have to decide.
I decided for having an “u” in the hopefully very rare case this occurs.
So use the following with care:

def truncateUTF8length(unicodeStr, maxsize):
    ur""" This method can be used to truncate the length of a given unicode
        string such that the corresponding utf-8 string won't exceed
        maxsize bytes. It will take care of multi-byte utf-8 chars intersecting
        with the maxsize limit: either the whole char fits or it will be
        truncated completely. Make sure that unicodeStr is in Unicode
        Normalization Form C (NFC), else strange things can happen as
        mentioned in the examples below.
        Returns a unicode string, so if you need it encoded as utf-8, call
        .decode("utf-8") after calling this method.
        >>> truncateUTF8lengthIfNecessary(u"ö", 2) == (u"ö", False)
        >>> truncateUTF8length(u"ö", 1) == u""
        >>> u'u1ebf'.encode('utf-8') == 'xe1xbaxbf'
        >>> truncateUTF8length(u'hiu1ebf', 2) == u"hi"
        >>> truncateUTF8lengthIfNecessary(u'hiu1ebf', 3) == (u"hi", True)
        >>> truncateUTF8length(u'hiu1ebf', 4) == u"hi"
        >>> truncateUTF8length(u'hiu1ebf', 5) == u"hiu1ebf"
        Make sure the unicodeStr is in NFC (see unicodedata.normalize("NFC", ...) ).
        The following would not be true, as e and u'u0301' would be seperate
        unicode chars. This could be handled with unicodedata.combining
        and a loop deleting chars from the end until after the first non
        combining char, but this is _not_ done here!
        #>>> u'eu0301'.encode('utf-8') == 'exccx81'
        #>>> truncateUTF8length(u'eu0301', 0) == u"" # not in NFC (u'xe9'), but in NFD
        #>>> truncateUTF8length(u'eu0301', 1) == u"" #decodes to utf-8:
        #>>> truncateUTF8length(u'eu0301', 2) == u""
        #>>> truncateUTF8length(u'eu0301', 3) == u"eu0301"

    return unicode(unicodeStr.encode("utf-8")[:maxsize], "utf-8", errors="ignore")

Unicode and UTF-8 is nice, but if you don’t pay attention it will cause your code to contain a lot of sleeping bugs. And yes, probably I’d care less if there was no “ö” in my name ;)

PS: Günther, this is SFW. :p

How to convert hex strings to binary ascii strings in python (incl. 8bit space)

As this comes across may way again and again:

How do you turn a hex string like "c3a4c3b6c3bc" into a nice binary string like this: "11000011 10100100 11000011 10110110 11000011 10111100"?

The solution is based on the Python 2.6 new string formatting:

>>> "{0:8b}".format(int("c3",16))

Which can be decomposed into 4 bit for each hex char like this: (notice the 04b, which means 0-padded 4chars long binary string):

>>> "{0:04b}".format(int("c",16)) + "{0:04b}".format(int("3",16))

OK, now we could easily do this for all hex chars "".join(["{0:04b}".format(int(c,16)) for c in "c3a4c3b6"]) and done, but usually we want a blank every 8 bits from the right to left… And looping from the right pairwise is a bit more complicated… Oh and what if the number of bits is uneven?
So the solution looks like this:

>>> binary = lambda x: " ".join(reversed( [i+j for i,j in zip( *[ ["{0:04b}".format(int(c,16)) for c in reversed("0"+x)][n::2] for n in [1,0] ] ) ] ))
>>> binary("c3a4c3b6c3bc")
'11000011 10100100 11000011 10110110 11000011 10111100'

It takes the hex string x, first of all concatenates a "0" to the left (for the uneven case), then reverses the string, converts every char into a 4-bit binary string, then collects all uneven indices of this list, zips them to all even indices, for each in the pairs-list concatenates them to 8-bit binary strings, reverses again and joins them together with a ” ” in between. In case of an even number the added 0 falls out, because there’s no one to zip with, if uneven it zips with the first hex-char.

Yupp, I like 1liners ;)

(URL)Encoding in python

Well, encodings are a never ending story and whenever you don’t want to waste time on them, it’s for sure that you’ll stumble over yet another tripwire. This time it is the encoding of URLs (note: even though related I’m not talking about the urlencode function). Perhaps you have seen something like this before:
http://de.wikipedia.org/wiki/Gerhard_Schr%C3%B6der which actually is the URI pendant to this IRI: http://de.wikipedia.org/wiki/Gehard_Schröder

Now what’s the problem, you might ask. The problem is that two things can happen here:
Either your browser (or the library you use) thinks: “hmm, this 'ö' is strange, let’s convert it into a '%C3%B6'” or your browser (or lib) doesn’t care and asks the server with the 'ö' in the URL, introducing a bit of non-determinism into your expectations, right?

More details here:

$ curl -I http://de.wikipedia.org/wiki/Gerhard_Schröder
HTTP/1.0 200 OK
Date: Thu, 22 Jul 2010 09:41:56 GMT
Last-Modified: Wed, 21 Jul 2010 11:50:31 GMT
Content-Length: 144996
Connection: close
$ curl -I http://de.wikipedia.org/wiki/Gerhard_Schr%C3%B6der
HTTP/1.0 200 OK
Date: Sat, 31 Jul 2010 00:24:47 GMT
Last-Modified: Thu, 29 Jul 2010 10:04:31 GMT
Content-Length: 144962
Connection: close

Notice how the Date, Last-Modified and Content-Length differ.

OK, so how do we deal with this? I’d say: let’s always ask for the “percentified” version… but before try to understand this:

# notice that my locale is en.UTF-8
>>> print "jörn"
>>> "jörn" # implicitly calls: print repr("jörn")
>>> print repr("jörn")
>>> u"jörn"
>>> print u"jörn"
>>> print u"jörn".encode("utf8")
>>> u"jörn".encode("utf8")
>>> "jörn".encode("utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

So, what happened here?
As my locale is set to use UTF-8 encoding, all my inputs are utf-8 encoded already.
If until now you might have wondered, why 'ö' is translated into '%C3%B6', you might have spotted that 'ö' corresponds to the utf-8 "xc3xb6", which actually is python’s in string escape sequence for non-ASCII chars: it refers to 2 bytes with the hex-code: c3b6 (binary: '11000011 10110110') (quite useful: "{0:b} {1:b}".format(int("c3", 16), int("b6",16))).
So in URLs these "xhh" are simply replaced by "%HH", so a percent and two uppercase ASCII-Chars indicating a hex-code. The unicode 'ö' (1 char, 1byte, unicode "xf6" ('11110110')) hence is first transformed into utf-8 (1char, 2byte, utf8: '11000011 10110110') by my OS, before entering it into python, internally kept in this form unless I use the u"" strings, and then represented in the URL with "%C3%B6" (6chars, 6byte, ASCII).
What this example also shows is the implicit print repr(var) performed by the interactive python interpreter when you simply enter some var and hit return.
Print will try to convert strings to the current locale if they’re Unicode-Strings (u""). Else python will not assume that the string has any specific encoding, but just stick with the encoding your OS chose. It will simply treat the string as it was received and write the byte-sequence to your sys.stdout.

So back to the manual quoting of URLs:

>>> import urllib as ul
>>> ul.quote("jörn")
>>> print ul.quote("jörn")

>>> ul.unquote('j%C3%B6rn')
>>> ul.unquote("jörn")
>>> print ul.unquote("jörn")