Why does printing to a utf-8 file fail?

Question

So I ran into a problem this afternoon, I was able to solve it, but I don't quite understand why it worked.

this is related to a problem I had the other week: python check if utf-8 string is uppercase

basically, the following will not work:

#!/usr/bin/python

import codecs
from lxml import etree

outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()

root = etree.Element('root')
sect = etree.SubElement(root,'sect')


words = (   u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
            u'R\xc9SUM\xc9',    # RESUME with accents
            u'R\xe9sum\xe9',    # Resume with accents
            u'R\xe9SUM\xe9', )  # ReSUMe with accents

for word in words:
    print word
    if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8 
        title = etree.SubElement(sect,'title')
        title.text = word
    else:
       item = etree.SubElement(sect,'item')
       item.text = word 

print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')

it fails with the following:

Traceback (most recent call last):
File "./temp.py", line 25, in
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
File "/usr/lib/python2.7/codecs.py",
line 691, in write
return self.writer.write(data) File "/usr/lib/python2.7/codecs.py",
line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec
can't decode byte 0xd0 in position 66:
ordinal not in range(128)

but if I open the new file without codecs.open('test.xml', 'w', 'utf-8') and instead use outFile = open('test.xml', 'w') it works perfectly.

So whats happening??

since encoding='utf-8' is specified in etree.tostring() is it encoding the file again?
if I leave codecs.open() and remove encoding='utf-8' the file then becomes an ascii file. Why? becuase etree.tostring() has a default encoding of ascii I persume?
but etree.tostring() is simply being written to stdout, and is then redirect to a file that was created as a utf-8 file??
- is print>> not workings as I expect? outFile.write(etree.tostring()) behaves the same way.

Basically, why wouldn't this work? what is going on here. It might be trivial, but I am obviously a bit confused and have a desire to figure out why my solution works,

It looks like tostring() produces a string, and writing to the file opened with codec.open expects unicode. Try using plain open() to open the file, and leaving the encoding='utf-8' parameter when you call tostring(). Also, word.encode('utf8').decode('utf8')!? — Thomas K
@Thomas K, 1. I am okay with using open(), just curious as to why. 2. word.encode('utf8').decode('utf8') must be intreperted wrong. I ensure you I need to use word.decode('utf8') in my project which has its words feeded to it from a different file. see this for more on .isupper() with utf8 => stackoverflow.com/questions/6391442/… — matchew
your call to tostring() produces an encoded string (not unicode). The file opened with codecs.open() expects to receive unicode, so when you give it a bytestring, it chokes (in fact, it tries to decode as ascii and re-encode as utf-8). 2: I understand that you may have needed to decode, but you're encoding and then immediately decoding with the same codec, which gives you back what you started with. I highly recommend reading this: joelonsoftware.com/articles/Unicode.html — Thomas K

MRAB MRAB · Accepted Answer · 2011-06-29T22:21:08

You've opened the file with UTF-8 encoding, which means that it expects Unicode strings.

tostring is encoding to UTF-8 (in the form of bytestrings, str), which you're writing to the file.

Because the file is expecting Unicode, it's decoding the bytestrings to Unicode using the default ASCII encoding so that it can then encode the Unicode to UTF-8.

Unfortunately, the bytestrings aren't ASCII.

EDIT: The best advice to avoid this kind of problem is to use Unicode internally, decoding on input and encoding on output.

Why does printing to a utf-8 file fail?

3 Answers