Ha, not with me!
It’s a pretty common tripwire: Imagine you have a unicode string and for whatever reason (which should be a good reason, so make sure you really need this) you need to make sure that its UTF-8 representation has at most maxsize bytes.
The first and in this case worst attempt is probably unicodeStr[:maxsize]
, as its UTF-8 representation could be up to 6 times as long.
So the next worse attempt could be this unicode(unicodeStr.encode("utf-8")[:maxsize], "utf-8")
: This could cut a multi-byte UTF-8 representation of a codepoint in half (example: unicode(u"jörn".encode("utf-8")[:2], "utf-8")
). Luckily python will tell you by throwing a UnicodeDecodeError.
The last attempt actually wasn’t that wrong, as it only lacked the errors="ignore"
flag:
unicode(myUnicodeStr.encode("utf-8")[:maxsize], "utf-8", errors="ignore")
One might think we’re done now, but this depends on your Unicode Normalization Form: Unicode allows Combined Characters, for example the precomposed u"ü"
could be represented by the decomposed sequence u"u"
and u"¨"
(see Unicode Normalization).
In my case I know that my unicode strings are in Unicode Normalization Form C (NFC) (at least the RDF Literal Specs say so. This means that if there is a precomposed char it will be used. Nevertheless Unicode potentially allows for Combined characters which do not have a precomposed canonical equivalent. In this case not even normalizing would help, multiple unicode chars would remain, leading to multiple multi-byte UTF-8 chars. In this case I’m unsure what’s the universal solution… for such a u”ü” is it better to have a u”u” or nothing in case of a split? You have to decide.
I decided for having an “u” in the hopefully very rare case this occurs.
So use the following with care:
def truncateUTF8length(unicodeStr, maxsize):
ur""" This method can be used to truncate the length of a given unicode
string such that the corresponding utf-8 string won't exceed
maxsize bytes. It will take care of multi-byte utf-8 chars intersecting
with the maxsize limit: either the whole char fits or it will be
truncated completely. Make sure that unicodeStr is in Unicode
Normalization Form C (NFC), else strange things can happen as
mentioned in the examples below.
Returns a unicode string, so if you need it encoded as utf-8, call
.decode("utf-8") after calling this method.
>>> truncateUTF8lengthIfNecessary(u"ö", 2) == (u"ö", False)
True
>>> truncateUTF8length(u"ö", 1) == u""
True
>>> u'u1ebf'.encode('utf-8') == 'xe1xbaxbf'
True
>>> truncateUTF8length(u'hiu1ebf', 2) == u"hi"
True
>>> truncateUTF8lengthIfNecessary(u'hiu1ebf', 3) == (u"hi", True)
True
>>> truncateUTF8length(u'hiu1ebf', 4) == u"hi"
True
>>> truncateUTF8length(u'hiu1ebf', 5) == u"hiu1ebf"
True
Make sure the unicodeStr is in NFC (see unicodedata.normalize("NFC", ...) ).
The following would not be true, as e and u'u0301' would be seperate
unicode chars. This could be handled with unicodedata.combining
and a loop deleting chars from the end until after the first non
combining char, but this is _not_ done here!
#>>> u'eu0301'.encode('utf-8') == 'exccx81'
#True
#>>> truncateUTF8length(u'eu0301', 0) == u"" # not in NFC (u'xe9'), but in NFD
#True
#>>> truncateUTF8length(u'eu0301', 1) == u"" #decodes to utf-8:
#True
#>>> truncateUTF8length(u'eu0301', 2) == u""
#True
#>>> truncateUTF8length(u'eu0301', 3) == u"eu0301"
#True
"""
return unicode(unicodeStr.encode("utf-8")[:maxsize], "utf-8", errors="ignore")
Unicode and UTF-8 is nice, but if you don’t pay attention it will cause your code to contain a lot of sleeping bugs. And yes, probably I’d care less if there was no “ö” in my name 😉
PS: Günther, this is SFW. :p
This should not be a problem at all.
PHP:
substr("äöüæ€", 2, 4)
Gives ‘öü’ instead of ‘öü怒. Same problem as Python.Ruby:
"äöüßæ€"[2..4]
Also same problem.Haskell:
take 3 (drop 2 "äöü߀æ")
A bit unwieldy, but gives the correct substring. Control codes in output, though (in ghci).Java/Scala:
"äöü߀æ".substring(2,4)
Gives ‘üß’, the desired result.Funny how issues like this tend to confirm my preference for languages.
That was not my point: In python this also works when using unicode strings (which is implicit in Java, Scala, Haskell I guess) (and as it will be in Python 3.0):
[cc_python]
>>> u”öü߀”[2:4]
u’xdfu20ac’
>>> print u”öü߀”[2:4]
߀
[/cc_python]
Nevertheless, the problem was: how do you truncate a unicode string so that the corresponding _UTF-8_ representation will only have maxsize bytes?
Nice. 🙂
Reached here looking for a PHP solution. For others –
PHP:
mb_strcut($myUnicodeStr, $maxBytes);
My Ruby version:
def truncateUTF8(unicode_string, maxsize)
return unicode_string.bytes.to_a[0..maxsize-1].pack('c*').force_encoding('UTF-8').encode("UTF-16BE", :invalid => :replace, :replace =>"").encode("UTF-8")
end
Thanks for this. However, this ruby implementation works on MRI ruby, but fails on JRuby. Details and work-around here: https://github.com/jruby/jruby/issues/861