6
votes

I have a dummie Python module with the utf-8 header that looks like this:

# -*- coding: utf-8 -*-
a = "á"
print type(a), a

Which prints:

<type 'str'> á

But I thought that all string literals inside a Python module declared as utf-8 whould automatically be of type unicode, intead of str. Am I missing something or is this the correct behaviour?

In order to get a as an unicode string I use:

a = u"á"

But this doesn't seem very "polite", nor practical. Is there a better option?

3
Use Python 3 instead and all strings will be Unicode.Mark Ransom
@MarkRansom I can't change the Python version because of compatibility issuesCaumons
What is not 'polite' about using a u'..' unicode literal? Why do you feel it is impractical?Martijn Pieters
Because it's really easy to forget using it, and you have to add the u char before ALL strings. The desired behaviour is the one with Python 3Caumons
from __future__ import unicode_literals will turn all literals into Unicode literals (Python 2.6 and 2.7).Sven Marnach

3 Answers

6
votes
# -*- coding: utf-8 -*-

doesn't make the string literals Unicode. Take this example, I have a file with an Arabic comment and string, file is utf-8:

# هذا تعليق عربي
print type('نص عربي')

if I run it it will throw a SyntaxError exception:

SyntaxError: Non-ASCII character '\xd9' in file file.py
on line 2, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details

so to allow this I have to add that line to tell the interpreter that the file is UTF-8 encoded:

# -*-coding: utf-8 -*-

# هذا تعليق عربي
print type('نص عربي')

now it runs fine but it still prints <type 'str'> unless I make the string Unicode:

# -*-coding: utf-8 -*-

# هذا تعليق عربي
print type(u'نص عربي')
5
votes

No, the codec at the top only informs Python how to interpret the source code, and uses that codec to interpret Unicode literals. It does not turn literal bytestrings into unicode values. As PEP 263 states:

This PEP proposes to introduce a syntax to declare the encoding of a Python source file. The encoding information is then used by the Python parser to interpret the file using the given encoding. Most notably this enhances the interpretation of Unicode literals in the source code and makes it possible to write Unicode literals using e.g. UTF-8 directly in an Unicode aware editor.

Emphasis mine.

Without the codec declaration, Python has no idea how to interpret non-ASCII characters:

$ cat /tmp/test.py 
example = '☃'
$ python2.7 /tmp/test.py 
  File "/tmp/test.py", line 1
SyntaxError: Non-ASCII character '\xe2' in file /tmp/test.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

If Python behaved the way you expect it to, you would not be able to literal bytestring values that contain non-ASCII byte values either.

If your terminal is configured to display UTF-8 values, then printing a UTF-8 encoded byte string will look 'correct', but only by virtue of luck that the encodings match.

The correct way to get unicode values, is by using unicode literals or by otherwise producing unicode (decoding from byte strings, converting integer codepoints to unicode characters, etc.):

unicode_snowman = '\xe2\x98\x83'.decode('utf8')
unicode_snowman = unichr(0x2603)

In Python 3, the codec also applies to how variable names are interpreted, as you can use letters and digits outside of the ASCII range in names. The default codec in Python 3 is UTF-8, as opposed to ASCII in Python 2.

2
votes

No this is just source code encoding. Please see http://www.python.org/dev/peps/pep-0263/

To define a source code encoding, a magic comment must be placed into the source files either as first or second line in the file, such as:

      # coding=<encoding name>

or (using formats recognized by popular editors)

      #!/usr/bin/python
      # -*- coding: <encoding name> -*-

or

      #!/usr/bin/python
      # vim: set fileencoding=<encoding name> :

This doesn't make all literals unicode just point how unicode literals should be decoded.

One should use unicode function or u prefix to set literal as unicode.

N.B. in python3 all strings are unicode.