Tag Archives: utf8

How to restrict the length of a unicode string

Ha, not with me!
It’s a pretty common tripwire: Imagine you have a unicode string and for whatever reason (which should be a good reason, so make sure you really need this) you need to make sure that its UTF-8 representation has at most maxsize bytes.
The first and in this case worst attempt is probably unicodeStr[:maxsize], as its UTF-8 representation could be up to 6 times as long.
So the next worse attempt could be this unicode(unicodeStr.encode("utf-8")[:maxsize], "utf-8"): This could cut a multi-byte UTF-8 representation of a codepoint in half (example: unicode(u"jörn".encode("utf-8")[:2], "utf-8")). Luckily python will tell you by throwing a UnicodeDecodeError.

The last attempt actually wasn’t that wrong, as it only lacked the errors="ignore" flag:

unicode(myUnicodeStr.encode("utf-8")[:maxsize], "utf-8", errors="ignore")

One might think we’re done now, but this depends on your Unicode Normalization Form: Unicode allows Combined Characters, for example the precomposed u"ü" could be represented by the decomposed sequence u"u" and u"¨" (see Unicode Normalization).
In my case I know that my unicode strings are in Unicode Normalization Form C (NFC) (at least the RDF Literal Specs say so. This means that if there is a precomposed char it will be used. Nevertheless Unicode potentially allows for Combined characters which do not have a precomposed canonical equivalent. In this case not even normalizing would help, multiple unicode chars would remain, leading to multiple multi-byte UTF-8 chars. In this case I’m unsure what’s the universal solution… for such a u”ü” is it better to have a u”u” or nothing in case of a split? You have to decide.
I decided for having an “u” in the hopefully very rare case this occurs.
So use the following with care:

def truncateUTF8length(unicodeStr, maxsize):
    ur""" This method can be used to truncate the length of a given unicode
        string such that the corresponding utf-8 string won't exceed
        maxsize bytes. It will take care of multi-byte utf-8 chars intersecting
        with the maxsize limit: either the whole char fits or it will be
        truncated completely. Make sure that unicodeStr is in Unicode
        Normalization Form C (NFC), else strange things can happen as
        mentioned in the examples below.
        Returns a unicode string, so if you need it encoded as utf-8, call
        .decode("utf-8") after calling this method.
        >>> truncateUTF8lengthIfNecessary(u"ö", 2) == (u"ö", False)
        True
        >>> truncateUTF8length(u"ö", 1) == u""
        True
        >>> u'u1ebf'.encode('utf-8') == 'xe1xbaxbf'
        True
        >>> truncateUTF8length(u'hiu1ebf', 2) == u"hi"
        True
        >>> truncateUTF8lengthIfNecessary(u'hiu1ebf', 3) == (u"hi", True)
        True
        >>> truncateUTF8length(u'hiu1ebf', 4) == u"hi"
        True
        >>> truncateUTF8length(u'hiu1ebf', 5) == u"hiu1ebf"
        True
       
        Make sure the unicodeStr is in NFC (see unicodedata.normalize("NFC", ...) ).
        The following would not be true, as e and u'u0301' would be seperate
        unicode chars. This could be handled with unicodedata.combining
        and a loop deleting chars from the end until after the first non
        combining char, but this is _not_ done here!
        #>>> u'eu0301'.encode('utf-8') == 'exccx81'
        #True
        #>>> truncateUTF8length(u'eu0301', 0) == u"" # not in NFC (u'xe9'), but in NFD
        #True
        #>>> truncateUTF8length(u'eu0301', 1) == u"" #decodes to utf-8:
        #True
        #>>> truncateUTF8length(u'eu0301', 2) == u""
        #True
        #>>> truncateUTF8length(u'eu0301', 3) == u"eu0301"
        #True
        """

    return unicode(unicodeStr.encode("utf-8")[:maxsize], "utf-8", errors="ignore")

Unicode and UTF-8 is nice, but if you don’t pay attention it will cause your code to contain a lot of sleeping bugs. And yes, probably I’d care less if there was no “ö” in my name ;)

PS: Günther, this is SFW. :p

(URL)Encoding in python

Well, encodings are a never ending story and whenever you don’t want to waste time on them, it’s for sure that you’ll stumble over yet another tripwire. This time it is the encoding of URLs (note: even though related I’m not talking about the urlencode function). Perhaps you have seen something like this before:
http://de.wikipedia.org/wiki/Gerhard_Schr%C3%B6der which actually is the URI pendant to this IRI: http://de.wikipedia.org/wiki/Gehard_Schröder

Now what’s the problem, you might ask. The problem is that two things can happen here:
Either your browser (or the library you use) thinks: “hmm, this 'ö' is strange, let’s convert it into a '%C3%B6'” or your browser (or lib) doesn’t care and asks the server with the 'ö' in the URL, introducing a bit of non-determinism into your expectations, right?

More details here:

$ curl -I http://de.wikipedia.org/wiki/Gerhard_Schröder
HTTP/1.0 200 OK
Date: Thu, 22 Jul 2010 09:41:56 GMT
...
Last-Modified: Wed, 21 Jul 2010 11:50:31 GMT
Content-Length: 144996
...
Connection: close
$ curl -I http://de.wikipedia.org/wiki/Gerhard_Schr%C3%B6der
HTTP/1.0 200 OK
Date: Sat, 31 Jul 2010 00:24:47 GMT
...
Last-Modified: Thu, 29 Jul 2010 10:04:31 GMT
Content-Length: 144962
...
Connection: close

Notice how the Date, Last-Modified and Content-Length differ.

OK, so how do we deal with this? I’d say: let’s always ask for the “percentified” version… but before try to understand this:

# notice that my locale is en.UTF-8
>>> print "jörn"
jörn
>>> "jörn" # implicitly calls: print repr("jörn")
'jxc3xb6rn'
>>> print repr("jörn")
'jxc3xb6rn'
>>> u"jörn"
u'jxf6rn'
>>> print u"jörn"
jörn
>>> print u"jörn".encode("utf8")
jörn
>>> u"jörn".encode("utf8")
'jxc3xb6rn'
>>> "jörn".encode("utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
'
jxc3xb6rn'.decode("utf8")
u'
jxf6rn'

So, what happened here?
As my locale is set to use UTF-8 encoding, all my inputs are utf-8 encoded already.
If until now you might have wondered, why 'ö' is translated into '%C3%B6', you might have spotted that 'ö' corresponds to the utf-8 "xc3xb6", which actually is python’s in string escape sequence for non-ASCII chars: it refers to 2 bytes with the hex-code: c3b6 (binary: '11000011 10110110') (quite useful: "{0:b} {1:b}".format(int("c3", 16), int("b6",16))).
So in URLs these "xhh" are simply replaced by "%HH", so a percent and two uppercase ASCII-Chars indicating a hex-code. The unicode 'ö' (1 char, 1byte, unicode "xf6" ('11110110')) hence is first transformed into utf-8 (1char, 2byte, utf8: '11000011 10110110') by my OS, before entering it into python, internally kept in this form unless I use the u"" strings, and then represented in the URL with "%C3%B6" (6chars, 6byte, ASCII).
What this example also shows is the implicit print repr(var) performed by the interactive python interpreter when you simply enter some var and hit return.
Print will try to convert strings to the current locale if they’re Unicode-Strings (u""). Else python will not assume that the string has any specific encoding, but just stick with the encoding your OS chose. It will simply treat the string as it was received and write the byte-sequence to your sys.stdout.

So back to the manual quoting of URLs:

>>> import urllib as ul
>>> ul.quote("jörn")
'j%C3%B6rn'
>>> print ul.quote("jörn")
j%C3%B6rn

>>> ul.unquote('j%C3%B6rn')
'jxc3xb6rn'
>>> ul.unquote("jörn")
'jxc3xb6rn'
>>> print ul.unquote("jörn")
jörn