How to restrict the length of a unicode string

Ha, not with me!

It’s a pretty common tripwire: Imagine you have a unicode string and for whatever reason (which should be a good reason, so make sure you really need this) you need to make sure that its UTF-8 representation has at most maxsize bytes.
The first and in this case worst attempt is probably unicodeStr[:maxsize], as its UTF-8 representation could be up to 6 times as long.
So the next worse attempt could be this unicode(unicodeStr.encode("utf-8")[:maxsize], "utf-8"): This could cut a multi-byte UTF-8 representation of a codepoint in half (example: unicode(u"jörn".encode("utf-8")[:2], "utf-8")). Luckily python will tell you by throwing a UnicodeDecodeError.

The last attempt actually wasn’t that wrong, as it only lacked the errors="ignore" flag:

unicode(myUnicodeStr.encode("utf-8")[:maxsize], "utf-8", errors="ignore")

One might think we’re done now, but this depends on your Unicode Normalization Form: Unicode allows Combined Characters, for example the precomposed u"ü" could be represented by the decomposed sequence u"u" and u"¨" (see Unicode Normalization).
In my case I know that my unicode strings are in Unicode Normalization Form C (NFC) (at least the RDF Literal Specs say so. This means that if there is a precomposed char it will be used. Nevertheless Unicode potentially allows for Combined characters which do not have a precomposed canonical equivalent. In this case not even normalizing would help, multiple unicode chars would remain, leading to multiple multi-byte UTF-8 chars. In this case I’m unsure what’s the universal solution… for such a u”ü” is it better to have a u”u” or nothing in case of a split? You have to decide.
I decided for having an “u” in the hopefully very rare case this occurs.
So use the following with care:

def truncateUTF8length(unicodeStr, maxsize):
    ur""" This method can be used to truncate the length of a given unicode
        string such that the corresponding utf-8 string won't exceed
        maxsize bytes. It will take care of multi-byte utf-8 chars intersecting
        with the maxsize limit: either the whole char fits or it will be
        truncated completely. Make sure that unicodeStr is in Unicode
        Normalization Form C (NFC), else strange things can happen as
        mentioned in the examples below.
        Returns a unicode string, so if you need it encoded as utf-8, call
        .decode("utf-8") after calling this method.
        >>> truncateUTF8lengthIfNecessary(u"ö", 2) == (u"ö", False)
        True
        >>> truncateUTF8length(u"ö", 1) == u""
        True
        >>> u'u1ebf'.encode('utf-8') == 'xe1xbaxbf'
        True
        >>> truncateUTF8length(u'hiu1ebf', 2) == u"hi"
        True
        >>> truncateUTF8lengthIfNecessary(u'hiu1ebf', 3) == (u"hi", True)
        True
        >>> truncateUTF8length(u'hiu1ebf', 4) == u"hi"
        True
        >>> truncateUTF8length(u'hiu1ebf', 5) == u"hiu1ebf"
        True

        Make sure the unicodeStr is in NFC (see unicodedata.normalize("NFC", ...) ).
        The following would not be true, as e and u'u0301' would be seperate
        unicode chars. This could be handled with unicodedata.combining
        and a loop deleting chars from the end until after the first non
        combining char, but this is _not_ done here!
        #>>> u'eu0301'.encode('utf-8') == 'exccx81'
        #True
        #>>> truncateUTF8length(u'eu0301', 0) == u"" # not in NFC (u'xe9'), but in NFD
        #True
        #>>> truncateUTF8length(u'eu0301', 1) == u"" #decodes to utf-8: 
        #True
        #>>> truncateUTF8length(u'eu0301', 2) == u""
        #True
        #>>> truncateUTF8length(u'eu0301', 3) == u"eu0301"
        #True
        """
    return unicode(unicodeStr.encode("utf-8")[:maxsize], "utf-8", errors="ignore")

Unicode and UTF-8 is nice, but if you don’t pay attention it will cause your code to contain a lot of sleeping bugs. And yes, probably I’d care less if there was no “ö” in my name 😉

PS: Günther, this is SFW. :p

6 thoughts on “How to restrict the length of a unicode string

  1. Raphael

    This should not be a problem at all.

    PHP: substr("äöüæ€", 2, 4) Gives ‘öü’ instead of ‘öü怒. Same problem as Python.

    Ruby: "äöüßæ€"[2..4] Also same problem.

    Haskell: take 3 (drop 2 "äöü߀æ") A bit unwieldy, but gives the correct substring. Control codes in output, though (in ghci).

    Java/Scala: "äöü߀æ".substring(2,4) Gives ‘üß’, the desired result.

    Funny how issues like this tend to confirm my preference for languages.

    Reply
    1. joern Post author

      That was not my point: In python this also works when using unicode strings (which is implicit in Java, Scala, Haskell I guess) (and as it will be in Python 3.0):
      [cc_python]
      >>> u”öü߀”[2:4]
      u’xdfu20ac’
      >>> print u”öü߀”[2:4]
      ߀
      [/cc_python]

      Nevertheless, the problem was: how do you truncate a unicode string so that the corresponding _UTF-8_ representation will only have maxsize bytes?

      Reply
  2. Nirmal

    Reached here looking for a PHP solution. For others –

    PHP:

    mb_strcut($myUnicodeStr, $maxBytes);

    Reply
  3. Angelo

    My Ruby version:


    def truncateUTF8(unicode_string, maxsize)
    return unicode_string.bytes.to_a[0..maxsize-1].pack('c*').force_encoding('UTF-8').encode("UTF-16BE", :invalid => :replace, :replace =>"").encode("UTF-8")
    end

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

 

This site uses Akismet to reduce spam. Learn how your comment data is processed.