{"id":297,"date":"2010-12-14T16:21:06","date_gmt":"2010-12-14T15:21:06","guid":{"rendered":"http:\/\/joernhees.de\/blog\/?p=297"},"modified":"2016-09-28T23:58:46","modified_gmt":"2016-09-28T21:58:46","slug":"how-to-restrict-the-length-of-a-unicode-string","status":"publish","type":"post","link":"https:\/\/joernhees.de\/blog\/2010\/12\/14\/how-to-restrict-the-length-of-a-unicode-string\/","title":{"rendered":"How to restrict the length of a unicode string"},"content":{"rendered":"<p>Ha, not with me!<\/p>\n<p>It&#8217;s a pretty common tripwire: Imagine you have a unicode string and for whatever reason (which should be a good reason, so make sure you really need this) you need to make sure that its UTF-8 representation has at most maxsize bytes.<br \/>\nThe first and in this case worst attempt is probably <code>unicodeStr[:maxsize]<\/code>, as its <a href=\"http:\/\/en.wikipedia.org\/wiki\/UTF-8\">UTF-8<\/a> representation could be up to 6 times as long.<br \/>\nSo the next worse attempt could be this <code>unicode(unicodeStr.encode(\"utf-8\")[:maxsize], \"utf-8\")<\/code>: This could cut a multi-byte UTF-8 representation of a codepoint in half (example: <code>unicode(u\"j\u00f6rn\".encode(\"utf-8\")[:2], \"utf-8\")<\/code>). Luckily python will tell you by throwing a UnicodeDecodeError.<\/p>\n<p>The last attempt actually wasn&#8217;t that wrong, as it only lacked the <code>errors=\"ignore\"<\/code> flag:<\/p>\n<pre><code class=\"python\">unicode(myUnicodeStr.encode(\"utf-8\")[:maxsize], \"utf-8\", errors=\"ignore\")\n<\/code><\/pre>\n<p>One might think we&#8217;re done now, but this depends on your <a href=\"http:\/\/en.wikipedia.org\/wiki\/Unicode_normalization\">Unicode Normalization Form<\/a>: Unicode allows <a href=\"http:\/\/en.wikipedia.org\/wiki\/Combining_character\">Combined Characters<\/a>, for example the precomposed <code>u\"\u00fc\"<\/code> could be represented by the decomposed sequence <code>u\"u\"<\/code> and <code>u\"\u00a8\"<\/code> (see <a href=\"http:\/\/en.wikipedia.org\/wiki\/Unicode_normalization\">Unicode Normalization<\/a>).<br \/>\nIn my case I know that my unicode strings are in Unicode Normalization Form C (NFC) (at least the <a href=\"http:\/\/www.w3.org\/TR\/rdf-concepts\/#section-Graph-Literal\">RDF Literal Specs<\/a> say so. This means that if there is a precomposed char it will be used. Nevertheless Unicode potentially allows for Combined characters which do not have a precomposed canonical equivalent. In this case not even normalizing would help, multiple unicode chars would remain, leading to multiple multi-byte UTF-8 chars. In this case I&#8217;m unsure what&#8217;s the universal solution&#8230; for such a u&#8221;\u00fc&#8221; is it better to have a u&#8221;u&#8221; or nothing in case of a split? You have to decide.<br \/>\nI decided for having an &#8220;u&#8221; in the hopefully very rare case this occurs.<br \/>\nSo use the following with care:<\/p>\n<pre><code class=\"python\">def truncateUTF8length(unicodeStr, maxsize):\n    ur\"\"\" This method can be used to truncate the length of a given unicode\n        string such that the corresponding utf-8 string won't exceed\n        maxsize bytes. It will take care of multi-byte utf-8 chars intersecting\n        with the maxsize limit: either the whole char fits or it will be\n        truncated completely. Make sure that unicodeStr is in Unicode\n        Normalization Form C (NFC), else strange things can happen as\n        mentioned in the examples below.\n        Returns a unicode string, so if you need it encoded as utf-8, call\n        .decode(\"utf-8\") after calling this method.\n        &gt;&gt;&gt; truncateUTF8lengthIfNecessary(u\"\u00f6\", 2) == (u\"\u00f6\", False)\n        True\n        &gt;&gt;&gt; truncateUTF8length(u\"\u00f6\", 1) == u\"\"\n        True\n        &gt;&gt;&gt; u'u1ebf'.encode('utf-8') == 'xe1xbaxbf'\n        True\n        &gt;&gt;&gt; truncateUTF8length(u'hiu1ebf', 2) == u\"hi\"\n        True\n        &gt;&gt;&gt; truncateUTF8lengthIfNecessary(u'hiu1ebf', 3) == (u\"hi\", True)\n        True\n        &gt;&gt;&gt; truncateUTF8length(u'hiu1ebf', 4) == u\"hi\"\n        True\n        &gt;&gt;&gt; truncateUTF8length(u'hiu1ebf', 5) == u\"hiu1ebf\"\n        True\n\n        Make sure the unicodeStr is in NFC (see unicodedata.normalize(\"NFC\", ...) ).\n        The following would not be true, as e and u'u0301' would be seperate\n        unicode chars. This could be handled with unicodedata.combining\n        and a loop deleting chars from the end until after the first non\n        combining char, but this is _not_ done here!\n        #&gt;&gt;&gt; u'eu0301'.encode('utf-8') == 'exccx81'\n        #True\n        #&gt;&gt;&gt; truncateUTF8length(u'eu0301', 0) == u\"\" # not in NFC (u'xe9'), but in NFD\n        #True\n        #&gt;&gt;&gt; truncateUTF8length(u'eu0301', 1) == u\"\" #decodes to utf-8: \n        #True\n        #&gt;&gt;&gt; truncateUTF8length(u'eu0301', 2) == u\"\"\n        #True\n        #&gt;&gt;&gt; truncateUTF8length(u'eu0301', 3) == u\"eu0301\"\n        #True\n        \"\"\"\n    return unicode(unicodeStr.encode(\"utf-8\")[:maxsize], \"utf-8\", errors=\"ignore\")\n<\/code><\/pre>\n<p>Unicode and UTF-8 is nice, but if you don&#8217;t pay attention it will cause your code to contain a lot of sleeping bugs. And yes, probably I&#8217;d care less if there was no &#8220;\u00f6&#8221; in my name \ud83d\ude09<\/p>\n<p>PS: G\u00fcnther, this is SFW. :p<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Ha, not with me! It&#8217;s a pretty common tripwire: Imagine you have a unicode string and for whatever reason (which should be a good reason, so make sure you really need this) you need to make sure that its UTF-8 representation has at most maxsize bytes. The first and in this case worst attempt is [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":""},"categories":[2],"tags":[44,132,176,182,183],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/pYA5n-4N","jetpack-related-posts":[{"id":314,"url":"https:\/\/joernhees.de\/blog\/2010\/12\/15\/python-unicode-doctest-howto-in-a-doctest\/","url_meta":{"origin":297,"position":0},"title":"Python unicode doctest howto in a doctest","date":"2010-12-15","format":false,"excerpt":"Another thing which has been on my stack for quite a while has been a unicode doctest howto, as I remember I was quite lost when I first tried to test encoding stuff in a doctest. So I thought the ultimate way to show how to do this would be\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":19,"url":"https:\/\/joernhees.de\/blog\/2010\/06\/28\/python-and-encoding\/","url_meta":{"origin":297,"position":1},"title":"Python and encoding","date":"2010-06-28","format":false,"excerpt":"Well, first real post, so let's start easy. I've been working a lot with python lately, and came across a nice short How to Use UTF-8 with Python which also makes the difference between unicode and utf8 very clear. The howto also links to another valuable source: Characters vs. Bytes,\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":166,"url":"https:\/\/joernhees.de\/blog\/2010\/07\/31\/urlencoding-in-python\/","url_meta":{"origin":297,"position":2},"title":"(URL)Encoding in python","date":"2010-07-31","format":false,"excerpt":"Well, encodings are a never ending story and whenever you don't want to waste time on them, it's for sure that you'll stumble over yet another tripwire. This time it is the encoding of URLs (note: even though related I'm not talking about the urlencode function). Perhaps you have seen\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":256,"url":"https:\/\/joernhees.de\/blog\/2010\/09\/21\/how-to-convert-hex-strings-to-binary-ascii-strings-in-python-incl-8bit-space\/","url_meta":{"origin":297,"position":3},"title":"How to convert hex strings to binary ascii strings in python (incl. 8bit space)","date":"2010-09-21","format":false,"excerpt":"As i come across this again and again: How do you turn a hex string like \"c3a4c3b6c3bc\" into a nice binary string like this: \"11000011 10100100 11000011 10110110 11000011 10111100\"? The solution is based on the Python 2.6 new string formatting: >>> \"{0:8b}\".format(int(\"c3\",16)) '11000011' Which can be decomposed into 4\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":360,"url":"https:\/\/joernhees.de\/blog\/2011\/09\/16\/mac-os-x-harddisk-high-load-cycle-counts\/","url_meta":{"origin":297,"position":4},"title":"Mac OS X Harddisk high Load Cycle Counts","date":"2011-09-16","format":false,"excerpt":"Mac OS X's default power management settings might wear your hard drive down unnecessarily. This post provides a lot of background information and how to change these settings.","rel":"","context":"In \"193\"","img":{"alt_text":"","src":"https:\/\/i2.wp.com\/joernhees.de\/blog\/wp-content\/uploads\/2011\/09\/RampLoadUnloadDynamics.png?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":566,"url":"https:\/\/joernhees.de\/blog\/2014\/02\/25\/scientific-python-on-mac-os-x-10-9-with-homebrew\/","url_meta":{"origin":297,"position":5},"title":"Scientific Python on Mac OS X 10.9+ with homebrew","date":"2014-02-25","format":false,"excerpt":"Scientific python setup guide for Mac OS X 10.9 Mavericks with homebrew","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/posts\/297"}],"collection":[{"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/comments?post=297"}],"version-history":[{"count":2,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/posts\/297\/revisions"}],"predecessor-version":[{"id":810,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/posts\/297\/revisions\/810"}],"wp:attachment":[{"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/media?parent=297"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/categories?post=297"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/tags?post=297"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}