{"id":166,"date":"2010-07-31T18:43:02","date_gmt":"2010-07-31T16:43:02","guid":{"rendered":"http:\/\/joernhees.de\/blog\/?p=166"},"modified":"2016-09-28T23:41:40","modified_gmt":"2016-09-28T21:41:40","slug":"urlencoding-in-python","status":"publish","type":"post","link":"https:\/\/joernhees.de\/blog\/2010\/07\/31\/urlencoding-in-python\/","title":{"rendered":"(URL)Encoding in python"},"content":{"rendered":"<p>Well, <a href=\"http:\/\/joernhees.de\/blog\/2010\/06\/28\/python-and-encoding\/\">encodings<\/a> are a never ending story and whenever you don&#8217;t want to waste time on them, it&#8217;s for sure that you&#8217;ll stumble over yet another tripwire. This time it is the encoding of URLs (note: even though related I&#8217;m not talking about the <code>urlencode<\/code> function). Perhaps you have seen something like this before:<br \/>\n<code>http:\/\/de.wikipedia.org\/wiki\/Gerhard_Schr%C3%B6der<\/code> which actually is the URI pendant to this <a href=\"http:\/\/en.wikipedia.org\/wiki\/Internationalized_Resource_Identifier\">IRI<\/a>: <code>http:\/\/de.wikipedia.org\/wiki\/Gehard_Schr\u00f6der<\/code><\/p>\n<p>Now what&#8217;s the problem, you might ask. The problem is that two things can happen here:<br \/>\nEither your browser (or the library you use) thinks: &#8220;hmm, this <code>'\u00f6'<\/code> is strange, let&#8217;s convert it into a <code>'%C3%B6'<\/code>&#8221; or your browser (or lib) doesn&#8217;t care and asks the server with the <code>'\u00f6'<\/code> in the URL, introducing a bit of non-determinism into your expectations, right?<\/p>\n<p>More details here:<\/p>\n<pre><code>$ curl -I http:\/\/de.wikipedia.org\/wiki\/Gerhard_Schr\u00f6der\nHTTP\/1.0 200 OK\nDate: Thu, 22 Jul 2010 09:41:56 GMT\n...\nLast-Modified: Wed, 21 Jul 2010 11:50:31 GMT\nContent-Length: 144996\n...\nConnection: close\n$ curl -I http:\/\/de.wikipedia.org\/wiki\/Gerhard_Schr%C3%B6der\nHTTP\/1.0 200 OK\nDate: Sat, 31 Jul 2010 00:24:47 GMT\n...\nLast-Modified: Thu, 29 Jul 2010 10:04:31 GMT\nContent-Length: 144962\n...\nConnection: close\n<\/code><\/pre>\n<p>Notice how the Date, Last-Modified and Content-Length differ.<\/p>\n<p>OK, so how do we deal with this? I&#8217;d say: let&#8217;s always ask for the &#8220;percentified&#8221; version&#8230; but before try to understand this:<\/p>\n<pre><code class=\"python\"># notice that my locale is en.UTF-8\n&gt;&gt;&gt; print \"j\u00f6rn\"\nj\u00f6rn\n&gt;&gt;&gt; \"j\u00f6rn\" # implicitly calls: print repr(\"j\u00f6rn\")\n'jxc3xb6rn'\n&gt;&gt;&gt; print repr(\"j\u00f6rn\")\n'jxc3xb6rn'\n&gt;&gt;&gt; u\"j\u00f6rn\"\nu'jxf6rn'\n&gt;&gt;&gt; print u\"j\u00f6rn\"\nj\u00f6rn\n&gt;&gt;&gt; print u\"j\u00f6rn\".encode(\"utf8\")\nj\u00f6rn\n&gt;&gt;&gt; u\"j\u00f6rn\".encode(\"utf8\")\n'jxc3xb6rn'\n&gt;&gt;&gt; \"j\u00f6rn\".encode(\"utf8\")\nTraceback (most recent call last):\n  File \"&lt;stdin&gt;\", line 1, in &lt;module&gt;\nUnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)\n'jxc3xb6rn'.decode(\"utf8\")\nu'jxf6rn'\n<\/code><\/pre>\n<p>So, what happened here?<br \/>\nAs my locale is set to use UTF-8 encoding, all my inputs are utf-8 encoded already.<br \/>\nIf until now you might have wondered, why <code>'\u00f6'<\/code> is translated into <code>'%C3%B6'<\/code>, you might have spotted that <code>'\u00f6'<\/code> corresponds to the utf-8 <code>\"xc3xb6\"<\/code>, which actually is python&#8217;s in string escape sequence for non-ASCII chars: it refers to 2 bytes with the hex-code: c3b6 (binary: <code>'11000011 10110110'<\/code>) (quite useful: <code>\"{0:b} {1:b}\".format(int(\"c3\", 16), int(\"b6\",16))<\/code>).<br \/>\nSo in URLs these <code>\"xhh\"<\/code> are simply replaced by <code>\"%HH\"<\/code>, so a percent and two uppercase ASCII-Chars indicating a hex-code. The unicode <code>'\u00f6'<\/code> (1 char, 1byte, unicode <code>\"xf6\"<\/code> (<code>'11110110'<\/code>)) hence is first transformed into utf-8 (1char, 2byte, utf8: <code>'11000011 10110110'<\/code>) by my OS, before entering it into python, internally kept in this form unless I use the <code>u\"\"<\/code> strings, and then represented in the URL with <code>\"%C3%B6\"<\/code> (6chars, 6byte, ASCII).<br \/>\nWhat this example also shows is the implicit <code>print repr(var)<\/code> performed by the interactive python interpreter when you simply enter some <code>var<\/code> and hit return.<br \/>\nPrint will try to convert strings to the current locale if they&#8217;re Unicode-Strings (<code>u\"\"<\/code>). Else python will not assume that the string has any specific encoding, but just stick with the encoding your OS chose. It will simply treat the string as it was received and write the byte-sequence to your <code>sys.stdout<\/code>.<\/p>\n<p>So back to the manual quoting of URLs:<\/p>\n<pre><code class=\"python\">&gt;&gt;&gt; import urllib as ul\n&gt;&gt;&gt; ul.quote(\"j\u00f6rn\")\n'j%C3%B6rn'\n&gt;&gt;&gt; print ul.quote(\"j\u00f6rn\")\nj%C3%B6rn\n\n&gt;&gt;&gt; ul.unquote('j%C3%B6rn')\n'jxc3xb6rn'\n&gt;&gt;&gt; ul.unquote(\"j\u00f6rn\")\n'jxc3xb6rn'\n&gt;&gt;&gt; print ul.unquote(\"j\u00f6rn\")\nj\u00f6rn\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Well, encodings are a never ending story and whenever you don&#8217;t want to waste time on them, it&#8217;s for sure that you&#8217;ll stumble over yet another tripwire. This time it is the encoding of URLs (note: even though related I&#8217;m not talking about the urlencode function). Perhaps you have seen something like this before: http:\/\/de.wikipedia.org\/wiki\/Gerhard_Schr%C3%B6der [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":""},"categories":[2],"tags":[44,132,136,176,178,179,180,182,183],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/pYA5n-2G","jetpack-related-posts":[{"id":19,"url":"https:\/\/joernhees.de\/blog\/2010\/06\/28\/python-and-encoding\/","url_meta":{"origin":166,"position":0},"title":"Python and encoding","date":"2010-06-28","format":false,"excerpt":"Well, first real post, so let's start easy. I've been working a lot with python lately, and came across a nice short How to Use UTF-8 with Python which also makes the difference between unicode and utf8 very clear. The howto also links to another valuable source: Characters vs. Bytes,\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":314,"url":"https:\/\/joernhees.de\/blog\/2010\/12\/15\/python-unicode-doctest-howto-in-a-doctest\/","url_meta":{"origin":166,"position":1},"title":"Python unicode doctest howto in a doctest","date":"2010-12-15","format":false,"excerpt":"Another thing which has been on my stack for quite a while has been a unicode doctest howto, as I remember I was quite lost when I first tried to test encoding stuff in a doctest. So I thought the ultimate way to show how to do this would be\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":256,"url":"https:\/\/joernhees.de\/blog\/2010\/09\/21\/how-to-convert-hex-strings-to-binary-ascii-strings-in-python-incl-8bit-space\/","url_meta":{"origin":166,"position":2},"title":"How to convert hex strings to binary ascii strings in python (incl. 8bit space)","date":"2010-09-21","format":false,"excerpt":"As i come across this again and again: How do you turn a hex string like \"c3a4c3b6c3bc\" into a nice binary string like this: \"11000011 10100100 11000011 10110110 11000011 10111100\"? The solution is based on the Python 2.6 new string formatting: >>> \"{0:8b}\".format(int(\"c3\",16)) '11000011' Which can be decomposed into 4\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":297,"url":"https:\/\/joernhees.de\/blog\/2010\/12\/14\/how-to-restrict-the-length-of-a-unicode-string\/","url_meta":{"origin":166,"position":3},"title":"How to restrict the length of a unicode string","date":"2010-12-14","format":false,"excerpt":"Ha, not with me! It's a pretty common tripwire: Imagine you have a unicode string and for whatever reason (which should be a good reason, so make sure you really need this) you need to make sure that its UTF-8 representation has at most maxsize bytes. The first and in\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":91,"url":"https:\/\/joernhees.de\/blog\/2010\/07\/21\/sort-python-dict-by-values\/","url_meta":{"origin":166,"position":4},"title":"Sort python dictionaries by values","date":"2010-07-21","format":false,"excerpt":"Perhaps you already encountered a problem like the following one yourself: You have a large list of items (let's say URIs for this example) and want to sum up how often they were viewed (or edited or... whatever). A small one-shot solution in python looks like the following and uses\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":526,"url":"https:\/\/joernhees.de\/blog\/2013\/06\/08\/mac-os-x-10-8-scientific-python-with-homebrew\/","url_meta":{"origin":166,"position":5},"title":"Scientific Python on Mac OS X 10.8 with homebrew","date":"2013-06-08","format":false,"excerpt":"(newer version of this guide) A step-by-step installation guide to setup a scientific python environment based on Mac OS X and homebrew. Needless to say: Make a backup (Timemachine) First install homebrew. Follow their instructions, then come back here. If you don't have a clean install, some of the following\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/posts\/166"}],"collection":[{"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/comments?post=166"}],"version-history":[{"count":2,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/posts\/166\/revisions"}],"predecessor-version":[{"id":801,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/posts\/166\/revisions\/801"}],"wp:attachment":[{"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/media?parent=166"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/categories?post=166"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/tags?post=166"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}