Ha, not with me!
It’s a pretty common tripwire: Imagine you have a unicode string and for whatever reason (which should be a good reason, so make sure you really need this) you need to make sure that its UTF-8 representation has at most maxsize bytes.
The first and in this case worst attempt is probably unicodeStr[:maxsize]
, as its UTF-8 representation could be up to 6 times as long.
So the next worse attempt could be this unicode(unicodeStr.encode("utf-8")[:maxsize], "utf-8")
: This could cut a multi-byte UTF-8 representation of a codepoint in half (example: unicode(u"jörn".encode("utf-8")[:2], "utf-8")
). Luckily python will tell you by throwing a UnicodeDecodeError.
The last attempt actually wasn’t that wrong, as it only lacked the errors="ignore"
flag:
unicode(myUnicodeStr.encode("utf-8")[:maxsize], "utf-8", errors="ignore")
One might think we’re done now, but this depends on your Unicode Normalization Form: Unicode allows Combined Characters, for example the precomposed u"ü"
could be represented by the decomposed sequence u"u"
and u"¨"
(see Unicode Normalization).
In my case I know that my unicode strings are in Unicode Normalization Form C (NFC) (at least the RDF Literal Specs say so. This means that if there is a precomposed char it will be used. Nevertheless Unicode potentially allows for Combined characters which do not have a precomposed canonical equivalent. In this case not even normalizing would help, multiple unicode chars would remain, leading to multiple multi-byte UTF-8 chars. In this case I’m unsure what’s the universal solution… for such a u”ü” is it better to have a u”u” or nothing in case of a split? You have to decide.
I decided for having an “u” in the hopefully very rare case this occurs.
So use the following with care:
def truncateUTF8length(unicodeStr, maxsize):
ur""" This method can be used to truncate the length of a given unicode
string such that the corresponding utf-8 string won't exceed
maxsize bytes. It will take care of multi-byte utf-8 chars intersecting
with the maxsize limit: either the whole char fits or it will be
truncated completely. Make sure that unicodeStr is in Unicode
Normalization Form C (NFC), else strange things can happen as
mentioned in the examples below.
Returns a unicode string, so if you need it encoded as utf-8, call
.decode("utf-8") after calling this method.
>>> truncateUTF8lengthIfNecessary(u"ö", 2) == (u"ö", False)
True
>>> truncateUTF8length(u"ö", 1) == u""
True
>>> u'u1ebf'.encode('utf-8') == 'xe1xbaxbf'
True
>>> truncateUTF8length(u'hiu1ebf', 2) == u"hi"
True
>>> truncateUTF8lengthIfNecessary(u'hiu1ebf', 3) == (u"hi", True)
True
>>> truncateUTF8length(u'hiu1ebf', 4) == u"hi"
True
>>> truncateUTF8length(u'hiu1ebf', 5) == u"hiu1ebf"
True
Make sure the unicodeStr is in NFC (see unicodedata.normalize("NFC", ...) ).
The following would not be true, as e and u'u0301' would be seperate
unicode chars. This could be handled with unicodedata.combining
and a loop deleting chars from the end until after the first non
combining char, but this is _not_ done here!
#>>> u'eu0301'.encode('utf-8') == 'exccx81'
#True
#>>> truncateUTF8length(u'eu0301', 0) == u"" # not in NFC (u'xe9'), but in NFD
#True
#>>> truncateUTF8length(u'eu0301', 1) == u"" #decodes to utf-8:
#True
#>>> truncateUTF8length(u'eu0301', 2) == u""
#True
#>>> truncateUTF8length(u'eu0301', 3) == u"eu0301"
#True
"""
return unicode(unicodeStr.encode("utf-8")[:maxsize], "utf-8", errors="ignore")
Unicode and UTF-8 is nice, but if you don’t pay attention it will cause your code to contain a lot of sleeping bugs. And yes, probably I’d care less if there was no “ö” in my name 😉
PS: Günther, this is SFW. :p