How to restrict the length of a unicode string

Ha, not with me!

It’s a pretty common tripwire: Imagine you have a unicode string and for whatever reason (which should be a good reason, so make sure you really need this) you need to make sure that its UTF-8 representation has at most maxsize bytes.
The first and in this case worst attempt is probably unicodeStr[:maxsize], as its UTF-8 representation could be up to 6 times as long.
So the next worse attempt could be this unicode(unicodeStr.encode("utf-8")[:maxsize], "utf-8"): This could cut a multi-byte UTF-8 representation of a codepoint in half (example: unicode(u"jörn".encode("utf-8")[:2], "utf-8")). Luckily python will tell you by throwing a UnicodeDecodeError.

The last attempt actually wasn’t that wrong, as it only lacked the errors="ignore" flag:

unicode(myUnicodeStr.encode("utf-8")[:maxsize], "utf-8", errors="ignore")

One might think we’re done now, but this depends on your Unicode Normalization Form: Unicode allows Combined Characters, for example the precomposed u"ü" could be represented by the decomposed sequence u"u" and u"¨" (see Unicode Normalization).
In my case I know that my unicode strings are in Unicode Normalization Form C (NFC) (at least the RDF Literal Specs say so. This means that if there is a precomposed char it will be used. Nevertheless Unicode potentially allows for Combined characters which do not have a precomposed canonical equivalent. In this case not even normalizing would help, multiple unicode chars would remain, leading to multiple multi-byte UTF-8 chars. In this case I’m unsure what’s the universal solution… for such a u”ü” is it better to have a u”u” or nothing in case of a split? You have to decide.
I decided for having an “u” in the hopefully very rare case this occurs.
So use the following with care:

def truncateUTF8length(unicodeStr, maxsize):
    ur""" This method can be used to truncate the length of a given unicode
        string such that the corresponding utf-8 string won't exceed
        maxsize bytes. It will take care of multi-byte utf-8 chars intersecting
        with the maxsize limit: either the whole char fits or it will be
        truncated completely. Make sure that unicodeStr is in Unicode
        Normalization Form C (NFC), else strange things can happen as
        mentioned in the examples below.
        Returns a unicode string, so if you need it encoded as utf-8, call
        .decode("utf-8") after calling this method.
        >>> truncateUTF8lengthIfNecessary(u"ö", 2) == (u"ö", False)
        True
        >>> truncateUTF8length(u"ö", 1) == u""
        True
        >>> u'u1ebf'.encode('utf-8') == 'xe1xbaxbf'
        True
        >>> truncateUTF8length(u'hiu1ebf', 2) == u"hi"
        True
        >>> truncateUTF8lengthIfNecessary(u'hiu1ebf', 3) == (u"hi", True)
        True
        >>> truncateUTF8length(u'hiu1ebf', 4) == u"hi"
        True
        >>> truncateUTF8length(u'hiu1ebf', 5) == u"hiu1ebf"
        True

        Make sure the unicodeStr is in NFC (see unicodedata.normalize("NFC", ...) ).
        The following would not be true, as e and u'u0301' would be seperate
        unicode chars. This could be handled with unicodedata.combining
        and a loop deleting chars from the end until after the first non
        combining char, but this is _not_ done here!
        #>>> u'eu0301'.encode('utf-8') == 'exccx81'
        #True
        #>>> truncateUTF8length(u'eu0301', 0) == u"" # not in NFC (u'xe9'), but in NFD
        #True
        #>>> truncateUTF8length(u'eu0301', 1) == u"" #decodes to utf-8: 
        #True
        #>>> truncateUTF8length(u'eu0301', 2) == u""
        #True
        #>>> truncateUTF8length(u'eu0301', 3) == u"eu0301"
        #True
        """
    return unicode(unicodeStr.encode("utf-8")[:maxsize], "utf-8", errors="ignore")

Unicode and UTF-8 is nice, but if you don’t pay attention it will cause your code to contain a lot of sleeping bugs. And yes, probably I’d care less if there was no “ö” in my name 😉

PS: Günther, this is SFW. :p

6 thoughts on “How to restrict the length of a unicode string”

Raphael 2010-12-14 at 19:07

This should not be a problem at all.

PHP: substr("äöüæ€", 2, 4) Gives ‘öü’ instead of ‘öüæ€’. Same problem as Python.

Ruby: "äöüßæ€"[2..4] Also same problem.

Haskell: take 3 (drop 2 "äöüß€æ") A bit unwieldy, but gives the correct substring. Control codes in output, though (in ghci).

Java/Scala: "äöüß€æ".substring(2,4) Gives ‘üß’, the desired result.

Funny how issues like this tend to confirm my preference for languages.

Reply ↓

joern Post author2010-12-14 at 19:17

That was not my point: In python this also works when using unicode strings (which is implicit in Java, Scala, Haskell I guess) (and as it will be in Python 3.0):
[cc_python]
>>> u”öüß€”[2:4]
u’xdfu20ac’
>>> print u”öüß€”[2:4]
ß€
[/cc_python]

Nevertheless, the problem was: how do you truncate a unicode string so that the corresponding _UTF-8_ representation will only have maxsize bytes?

Reply ↓

Günther 2010-12-17 at 22:56

Nice. 🙂

Reply ↓

Nirmal 2011-02-10 at 03:09

Reached here looking for a PHP solution. For others –

PHP:


mb_strcut($myUnicodeStr, $maxBytes);

Reply ↓

Angelo 2012-07-12 at 13:50

My Ruby version:

def truncateUTF8(unicode_string, maxsize) return unicode_string.bytes.to_a[0..maxsize-1].pack('c*').force_encoding('UTF-8').encode("UTF-16BE", :invalid => :replace, :replace =>"").encode("UTF-8") end

Reply ↓

Peter Vandenabeele (@peter_v) 2013-07-08 at 12:33

Thanks for this. However, this ruby implementation works on MRI ruby, but fails on JRuby. Details and work-around here: https://github.com/jruby/jruby/issues/861

Reply ↓

Jörn's Blog

Science, code and stuff…

How to restrict the length of a unicode string

Related

6 thoughts on “How to restrict the length of a unicode string”

Leave a Reply Cancel reply