So I ran into a problem this afternoon, I was able to solve it, but I don't quite understand why it worked.
this is related to a problem I had the other week: python check if utf-8 string is uppercase
basically, the following will not work:
#!/usr/bin/python
import codecs
from lxml import etree
outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
words = ( u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
u'R\xc9SUM\xc9', # RESUME with accents
u'R\xe9sum\xe9', # Resume with accents
u'R\xe9SUM\xe9', ) # ReSUMe with accents
for word in words:
print word
if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8
title = etree.SubElement(sect,'title')
title.text = word
else:
item = etree.SubElement(sect,'item')
item.text = word
print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
it fails with the following:
Traceback (most recent call last):
File "./temp.py", line 25, in
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
File "/usr/lib/python2.7/codecs.py",
line 691, in write
return self.writer.write(data) File "/usr/lib/python2.7/codecs.py",
line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec
can't decode byte 0xd0 in position 66:
ordinal not in range(128)
but if I open the new file without codecs.open('test.xml', 'w', 'utf-8')
and instead use
outFile = open('test.xml', 'w')
it works perfectly.
So whats happening??
since
encoding='utf-8'
is specified inetree.tostring()
is it encoding the file again?if I leave
codecs.open()
and removeencoding='utf-8'
the file then becomes an ascii file. Why? becuaseetree.tostring()
has a default encoding of ascii I persume?but
etree.tostring()
is simply being written to stdout, and is then redirect to a file that was created as a utf-8 file??- is
print>>
not workings as I expect?outFile.write(etree.tostring())
behaves the same way.
- is
Basically, why wouldn't this work? what is going on here. It might be trivial, but I am obviously a bit confused and have a desire to figure out why my solution works,
open()
to open the file, and leaving the encoding='utf-8' parameter when you call tostring(). Also,word.encode('utf8').decode('utf8')
!? – Thomas Kopen()
, just curious as to why. 2.word.encode('utf8').decode('utf8')
must be intreperted wrong. I ensure you I need to use word.decode('utf8') in my project which has its words feeded to it from a different file. see this for more on .isupper() with utf8 => stackoverflow.com/questions/6391442/… – matchew