2
votes

So I ran into a problem this afternoon, I was able to solve it, but I don't quite understand why it worked.

this is related to a problem I had the other week: python check if utf-8 string is uppercase

basically, the following will not work:

#!/usr/bin/python

import codecs
from lxml import etree

outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()

root = etree.Element('root')
sect = etree.SubElement(root,'sect')


words = (   u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
            u'R\xc9SUM\xc9',    # RESUME with accents
            u'R\xe9sum\xe9',    # Resume with accents
            u'R\xe9SUM\xe9', )  # ReSUMe with accents

for word in words:
    print word
    if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8 
        title = etree.SubElement(sect,'title')
        title.text = word
    else:
       item = etree.SubElement(sect,'item')
       item.text = word 

print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')

it fails with the following:

Traceback (most recent call last):
File "./temp.py", line 25, in
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
File "/usr/lib/python2.7/codecs.py",
line 691, in write
return self.writer.write(data) File "/usr/lib/python2.7/codecs.py",
line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec
can't decode byte 0xd0 in position 66:
ordinal not in range(128)

but if I open the new file without codecs.open('test.xml', 'w', 'utf-8') and instead use outFile = open('test.xml', 'w') it works perfectly.

So whats happening??

  • since encoding='utf-8' is specified in etree.tostring() is it encoding the file again?

  • if I leave codecs.open() and remove encoding='utf-8' the file then becomes an ascii file. Why? becuase etree.tostring() has a default encoding of ascii I persume?

  • but etree.tostring() is simply being written to stdout, and is then redirect to a file that was created as a utf-8 file??

    • is print>> not workings as I expect? outFile.write(etree.tostring()) behaves the same way.

Basically, why wouldn't this work? what is going on here. It might be trivial, but I am obviously a bit confused and have a desire to figure out why my solution works,

3
It looks like tostring() produces a string, and writing to the file opened with codec.open expects unicode. Try using plain open() to open the file, and leaving the encoding='utf-8' parameter when you call tostring(). Also, word.encode('utf8').decode('utf8')!?Thomas K
@Thomas K, 1. I am okay with using open(), just curious as to why. 2. word.encode('utf8').decode('utf8') must be intreperted wrong. I ensure you I need to use word.decode('utf8') in my project which has its words feeded to it from a different file. see this for more on .isupper() with utf8 => stackoverflow.com/questions/6391442/…matchew
your call to tostring() produces an encoded string (not unicode). The file opened with codecs.open() expects to receive unicode, so when you give it a bytestring, it chokes (in fact, it tries to decode as ascii and re-encode as utf-8). 2: I understand that you may have needed to decode, but you're encoding and then immediately decoding with the same codec, which gives you back what you started with. I highly recommend reading this: joelonsoftware.com/articles/Unicode.htmlThomas K
@matchew: You might want to take a look at that answer.tzot

3 Answers

3
votes

You've opened the file with UTF-8 encoding, which means that it expects Unicode strings.

tostring is encoding to UTF-8 (in the form of bytestrings, str), which you're writing to the file.

Because the file is expecting Unicode, it's decoding the bytestrings to Unicode using the default ASCII encoding so that it can then encode the Unicode to UTF-8.

Unfortunately, the bytestrings aren't ASCII.

EDIT: The best advice to avoid this kind of problem is to use Unicode internally, decoding on input and encoding on output.

1
votes

Using print>>outFile is a little strange. I don't have lxml installed, but the built-in xml.etree library is similar (but doesn't support pretty_print). Wrap the root Element in an ElementTree and use the write method.

Also, if you using a # coding line to declare the encoding of the source file, you can use readable Unicode strings instead of escape codes:

#!/usr/bin/python
# coding: utf8

import codecs
from xml.etree import ElementTree as etree

root = etree.Element(u'root')
sect = etree.SubElement(root,u'sect')


words = [u'МОСКВА',u'RÉSUMÉ',u'Résumé',u'RéSUMé']

for word in words:
    print word
    if word.isupper():
        title = etree.SubElement(sect,u'title')
        title.text = word
    else:
       item = etree.SubElement(sect,u'item')
       item.text = word 

tree = etree.ElementTree(root)
tree.write('text.xml',xml_declaration=True,encoding='utf-8')
0
votes

In addition to MRABs answer some lines of code:

import codecs
from lxml import etree

root = etree.Element('root')
sect = etree.SubElement(root,'sect')

# do some other xml building here

with codecs.open('test.xml', 'w', encoding='utf-8') as f:
    f.write(etree.tostring(root, encoding=unicode))