{"id":314,"date":"2010-12-15T08:12:34","date_gmt":"2010-12-15T07:12:34","guid":{"rendered":"http:\/\/joernhees.de\/blog\/?p=314"},"modified":"2016-09-29T00:00:11","modified_gmt":"2016-09-28T22:00:11","slug":"python-unicode-doctest-howto-in-a-doctest","status":"publish","type":"post","link":"https:\/\/joernhees.de\/blog\/2010\/12\/15\/python-unicode-doctest-howto-in-a-doctest\/","title":{"rendered":"Python unicode doctest howto in a doctest"},"content":{"rendered":"<p>Another thing which has been on my stack for quite a while has been a unicode doctest howto, as I remember I was quite lost when I first tried to test encoding stuff in a doctest.<br \/>\nSo I thought the ultimate way to show how to do this would be in a doctest \ud83d\ude09<\/p>\n<pre><code class=\"python\"># -*- coding: utf-8 -*-\n\ndef testDocTestUnicode():\n    ur\"\"\"Non ascii letters in doctests actually are tricky. The reason why\n        things work here that usually don't (each marked with a #BAD!) is\n        explained quite in the end of this doctest, but the essence is: we\n        didn't only fix the encoding of this file, but also the\n        sys.defaultencoding, which you should never do.\n\n        This file has a utf8 input encoding, which python is informed about by\n        the first line: # -*- coding: utf-8 -*-. This means that for example an\n        \u00e4 is 2 bytes: 11000011 10100100 (hexval \"c3a4\").\n\n        There are two types of strings in Python 2.x: \"\" aka byte strings and\n        u\"\" aka unicode string. For these two types two different things happen\n        when parsing a file:\n\n        If python encounters a non ascii char in a byte string (e.g., \"\u00e4\") it\n        will check if there's an input encoding given (yes, utf8) and then check\n        if the 2 bytes \u00e4 is a valid utf-8 encoded char (yes it is). It will then\n        simply keep the \u00e4 as its 2 byte utf-8 encoding in this byte-string\n        internal representation. If you print it and you're lucky to have a utf8\n        console you'll see an \u00e4 again. If you're not lucky and for example have\n        a iso-8859-15 encoding on your console you'll see 2 strange chars\n        (probably \u00c3\u20ac) instead. So python will simply write the byte-string to\n        output.\n\n        &gt;&gt;&gt; print \"\u00e4\" #BAD!\n        \u00e4\n\n        If there was no encoding given, we'd get a SyntaxError: Non-ASCII\n        character 'xc3' in file ..., which is the first byte of our 2 byte \u00e4.\n        Where did the 'xc3' come from? Well, this is python's way of writing a\n        non ascii byte to ascii output (which is always safe, so perfect for\n        this error message): it will write a x and then two hex chars for each\n        byte. Python does the same if we call:\n\n        &gt;&gt;&gt; print repr(\"\u00e4\")\n        'xc3xa4'\n\n        Or just\n        &gt;&gt;&gt; \"\u00e4\"\n        'xc3xa4'\n\n        It also works the other way around, so you can give an arbitrary byte by\n        using the same xXX escape sequences:\n        &gt;&gt;&gt; print \"xc3xa4\" #BAD!\n        \u00e4\n\n        Oh look, we hit the utf8 representation of an \u00e4, what a luck. You'll ask\n        how do I then print \"xc3xa4\" to my console? You can either double all\n        \"\" or tell python it's a raw string:\n        &gt;&gt;&gt; print \"\\xc3\\xa4\"\n        xc3xa4\n        &gt;&gt;&gt; print r\"xc3xa4\"\n        xc3xa4\n\n\n\n        If python encounters a unicode string in our document (e.g., u\"\u00e4\") it\n        will use the specified file encoding to convert our 2 byte utf8 \u00e4 into a\n        unicode string. This is the same as calling \"\u00e4\".decode(myFileEncoding):\n        &gt;&gt;&gt; print u\"\u00e4\" # BAD for another reason!\n        \u00e4\n        &gt;&gt;&gt; u\"\u00e4\"\n        u'xe4'\n        &gt;&gt;&gt; \"\u00e4\".decode(\"utf-8\")\n        u'xe4'\n\n        Python's internal unicode representation of this string is never exposed\n        to the user (it could be UTF-16 or 32 or anything else, anyone?).\n        The hex e4 corresponds to 11100100, the unicode ord value of the char \u00e4,\n        which is decimal 228.\n        &gt;&gt;&gt; ord(u'\u00e4')\n        228\n\n        And the same again backwards, we can use the xXX escaping to denote a\n        hex unicode point or raw not to interpret such escaping:\n        &gt;&gt;&gt; print u\"xe4\"\n        \u00e4\n        &gt;&gt;&gt; print ur\"xe4\"\n        xe4\n\n        Oh, noticed the difference? This time print did some magic. I told\n        you, you'll never see python's internal representation of a unicode\n        string. So whenever print receives a unicode string it will try to\n        convert it to your output encoding (sys.out.encoding), which works in a\n        terminal, but won't work if you're for example redirecting output to a\n        file. In such cases you have to convert the string into the desired\n        encoding explicitly:\n        &gt;&gt;&gt; u\"\u00e4\".encode(\"utf8\")\n        'xc3xa4'\n        &gt;&gt;&gt; print u\"\u00e4\".encode(\"utf8\") #BAD!\n        \u00e4\n\n        If that last line confused you a bit: We converted the unicode string\n        to a byte-string, which was then simply copied byte-wise by print and\n        voila, we got an \u00e4.\n\n\n\n        This all is done before the string even reaches doctest.\n        So you might have written something like all the above in doctests,\n        and probably saw them failing. In most cases you probably just \n        forgot the ur'''prefix''', but sometimes you had it and were confused.\n        Well this is good, as all of the above #BAD! examples don't make much sense.\n\n        Bummer, right.\n\n        The reason is: we made assumptions on the default encoding all over the\n        place, which is not a thing you would ever want to do in production\n        code. We did this by setting sys.setdefaultencoding(\"UTF-8\")\n        below. Without this you'll usually get unicode warnings like this one:\n        \"UnicodeWarning: Unicode equal comparison failed to convert both\n        arguments to Unicode - interpreting them as being unequal\".\n        Just fire up a python interpreter (not pydev, as I noticed it seems to\n        fiddle with the default setting).\n        Try: u\"\u00e4\" == \"\u00e4\"\n        You should get:\n            __main__:1: UnicodeWarning: Unicode equal comparison failed to convert both\n                arguments to Unicode - interpreting them as being unequal\n            False\n\n        This actually is very good, as it warns you that you're comparing some\n        byte-string from whatever location (could be a file) to a unicode string.\n        Shall python guess the encoding? Silently? Probably a bad idea.\n\n        Now if you do the following in your python interpreter:\n            import sys\n            reload(sys)\n            sys.setdefaultencoding(\"utf8\")\n            u\"\u00e4\" == \"\u00e4\"\n        You should get:\n            True\n\n        No wonder, you explicitly told python to interpret the \"\u00e4\" as utf8\n        encoded when nothing else specified.\n\n        So what's the problem in our docstrings again? We had these bad\n        examples:\n\n        &gt;&gt;&gt; print \"\u00e4\" #BAD!\n        \u00e4\n        &gt;&gt;&gt; print \"xc3xa4\" #BAD!\n        \u00e4\n        &gt;&gt;&gt; print u\"\u00e4\".encode(\"utf8\") #BAD!\n        \u00e4\n\n        Well, we're in a ur'''docstring''' here, so what doctest does is: it\n        takes the part after &gt;&gt;&gt; and exec(utes) it. There's one special feature\n        of exec i wasn't aware of: if you pass a unicode string to it, it will\n        revert the char back to utf-8:\n\n        &gt;&gt;&gt; exec u'print repr(\"\u00e4\")'\n        'xc3xa4'\n        &gt;&gt;&gt; exec u'print repr(\"xe4\")'\n        'xc3xa4'\n\n        This means that even though one might think that print \"\u00e4\" in this\n        unicode docstring will get print \"xe4\", it will print as if you wrote\n        print \"\u00e4\" outside of a unicode string, so as if you wrote print\n        \"xc3xa4\". Let this twist your mind for a second. The doctest will\n        execute as if there had been no conversion to a unicode string, which is\n        what you want. But now comes the comparison. It will see what comes out\n        of that and compare to the next line from this docstring, which now is a\n        unicode \"\u00e4\", so xe4. Hence we're now comparing u'xe4' == 'xc3xa4'.\n        If you didn't notice, this is the same we did in the python interpreter\n        above: we were comparing u\"\u00e4\" == \"\u00e4\". And again python tells us \"Hmm,\n        don't know shall I guess how to convert \"\u00e4\" to u\"\u00e4\"? Probably not, so\n        evaluate to False.\n\n\n        Summary:\n        Always specify the source encoding: # -*- coding: utf-8 -*-\n        and _ALWAYS_, no excuse, use utf-8. Repeat it: I will never use\n        iso-8859-x, latin-1 or anything else, I'll use UTF-8 so I can write\n        J\u00f6rn and he can actually read his name once.\n        Use ur'''...''' surrounded docstrings (so a raw unicode docstring).\n        You can also use ru'''...''', but I always think Russian strings?\n        Never compare a unicode string with a byte string. This means: don't\n        use u\"\u00e4\" and \"\u00e4\" mixed, they're not the same. Also the result line can\n        only match unicode strings plain ascii, no other encoding.\n\n        The following are bad comparisons, as they will compare byte- and\n        unicode strings. They'll cause warnings and eval to false:\n        #&gt;&gt;&gt; u\"\u00e4\" == \"\u00e4\"\n        #False\n        #&gt;&gt;&gt; \"\u00e4\".decode(\"utf8\") == \"\u00e4\" \n        #False\n        #&gt;&gt;&gt; print \"\u00e4\"\n        #\u00e4\n\n\n        So finally a few working examples:  \n\n        &gt;&gt;&gt; \"\u00e4\" # if file encoding is utf8\n        'xc3xa4'\n        &gt;&gt;&gt; u\"\u00e4\"\n        u'xe4'\n\n        Here both are unicode, so no problem, but nevertheless a bad idea to\n        match output of print due to the print magic mentioned above and think\n        about i18n: time formats, commas, dots, float precision, etc. \n        &gt;&gt;&gt; print u\"\u00e4\" # unicode even after exec, no prob.\n        \u00e4\n\n        Better:\n        &gt;&gt;&gt; \"\u00e4\" == \"\u00e4\" # compares byte-strings\n        True\n        &gt;&gt;&gt; u\"\u00e4\".encode(\"utf8\") == \"\u00e4\" # compares byte-strings\n        True\n        &gt;&gt;&gt; u\"\u00e4\" == u\"\u00e4\" # compares unicode-strings\n        True\n        &gt;&gt;&gt; \"\u00e4\".decode(\"utf8\") == u\"\u00e4\" # compares unicode-strings\n        True\n    \"\"\"\n    pass\n\n\nif __name__ == \"__main__\":\n    import sys\n    reload(sys)\n    sys.setdefaultencoding(\"UTF-8\") # DON'T DO THIS. READ THE ABOVE @UndefinedVariable\n    import doctest\n    doctest.testmod()\n\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Another thing which has been on my stack for quite a while has been a unicode doctest howto, as I remember I was quite lost when I first tried to test encoding stuff in a doctest. So I thought the ultimate way to show how to do this would be in a doctest \ud83d\ude09 # [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":""},"categories":[2],"tags":[38,44,69,132,137,144,157,176],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/pYA5n-54","jetpack-related-posts":[{"id":19,"url":"https:\/\/joernhees.de\/blog\/2010\/06\/28\/python-and-encoding\/","url_meta":{"origin":314,"position":0},"title":"Python and encoding","date":"2010-06-28","format":false,"excerpt":"Well, first real post, so let's start easy. I've been working a lot with python lately, and came across a nice short How to Use UTF-8 with Python which also makes the difference between unicode and utf8 very clear. The howto also links to another valuable source: Characters vs. Bytes,\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":166,"url":"https:\/\/joernhees.de\/blog\/2010\/07\/31\/urlencoding-in-python\/","url_meta":{"origin":314,"position":1},"title":"(URL)Encoding in python","date":"2010-07-31","format":false,"excerpt":"Well, encodings are a never ending story and whenever you don't want to waste time on them, it's for sure that you'll stumble over yet another tripwire. This time it is the encoding of URLs (note: even though related I'm not talking about the urlencode function). Perhaps you have seen\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":297,"url":"https:\/\/joernhees.de\/blog\/2010\/12\/14\/how-to-restrict-the-length-of-a-unicode-string\/","url_meta":{"origin":314,"position":2},"title":"How to restrict the length of a unicode string","date":"2010-12-14","format":false,"excerpt":"Ha, not with me! It's a pretty common tripwire: Imagine you have a unicode string and for whatever reason (which should be a good reason, so make sure you really need this) you need to make sure that its UTF-8 representation has at most maxsize bytes. The first and in\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":256,"url":"https:\/\/joernhees.de\/blog\/2010\/09\/21\/how-to-convert-hex-strings-to-binary-ascii-strings-in-python-incl-8bit-space\/","url_meta":{"origin":314,"position":3},"title":"How to convert hex strings to binary ascii strings in python (incl. 8bit space)","date":"2010-09-21","format":false,"excerpt":"As i come across this again and again: How do you turn a hex string like \"c3a4c3b6c3bc\" into a nice binary string like this: \"11000011 10100100 11000011 10110110 11000011 10111100\"? The solution is based on the Python 2.6 new string formatting: >>> \"{0:8b}\".format(int(\"c3\",16)) '11000011' Which can be decomposed into 4\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":526,"url":"https:\/\/joernhees.de\/blog\/2013\/06\/08\/mac-os-x-10-8-scientific-python-with-homebrew\/","url_meta":{"origin":314,"position":4},"title":"Scientific Python on Mac OS X 10.8 with homebrew","date":"2013-06-08","format":false,"excerpt":"(newer version of this guide) A step-by-step installation guide to setup a scientific python environment based on Mac OS X and homebrew. Needless to say: Make a backup (Timemachine) First install homebrew. Follow their instructions, then come back here. If you don't have a clean install, some of the following\u2026","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":566,"url":"https:\/\/joernhees.de\/blog\/2014\/02\/25\/scientific-python-on-mac-os-x-10-9-with-homebrew\/","url_meta":{"origin":314,"position":5},"title":"Scientific Python on Mac OS X 10.9+ with homebrew","date":"2014-02-25","format":false,"excerpt":"Scientific python setup guide for Mac OS X 10.9 Mavericks with homebrew","rel":"","context":"In &quot;Coding&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/posts\/314"}],"collection":[{"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/comments?post=314"}],"version-history":[{"count":2,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/posts\/314\/revisions"}],"predecessor-version":[{"id":811,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/posts\/314\/revisions\/811"}],"wp:attachment":[{"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/media?parent=314"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/categories?post=314"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/joernhees.de\/blog\/wp-json\/wp\/v2\/tags?post=314"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}