0
votes

I am trying to extract the attribute value Body from row element in pi.xml.

    cat pi.xml
    <?xml version="1.0" encoding="utf-8"?>
    <posts>
         <row Id="19" Body=" The value of π, the value of pi." />
    </posts>

The python file, pi.py :

    from lxml import etree
    doc = etree.parse('pi.xml')
    r = doc.findall('row')
    for i in r:
        print (i.get('Body'))

And the locale:

    $ locale:
    LANG=en_IN
    LANGUAGE=en_IN:en
    LC_CTYPE="en_IN"
    LC_NUMERIC="en_IN"
    LC_TIME="en_IN"
    LC_COLLATE="en_IN"    
    LC_ALL=

Upon running pi.py as as python pi.py, everything is fine.
But, if I try to redirect the output and run pi.py as python pi.py >> pi.txt - I get an error message - UnicodeEncodeError: 'ascii' codec can't encode character u'\u03c0' in position 101: ordinal not in range(128)

If I change print (i.get('Body')) to print (i.get('Body')).encode('utf-8') , then python pi.py >> pi.txt works fine. But, is this the proper way to do it?

Operating System - Ubuntu.

1
Try: $ PYTHONIOENCODING=utf8 python pi.py >> py.txt.Mark Tolonen
It worked , thanks Mark! But, I haven't got any solution that would work everywhere. When I used this $ PYTHONIOENCODING=utf8 python somefile.py >> somefile.txt to other files it didn't work there (same UnicodeEncodeError is thrown) . I'll try finding the solution, if i get one I'll post here.abT
If the file explicitly encodes its output, this method wouldn't work. Scripts should just print Unicode and let the terminal decide the encoding.Mark Tolonen
@Mark Tolonen: So, is it correct if i use x.decode('utf-8') upon reading 'x' from a utf-8 encoded file and then print processed_x.encode('utf-8') to save output to another file? Also, this always works and never gives any error. Looking for your suggestion.abT
print processed_x.encode('utf-8') works if the console is configured for UTF-8, but it wouldn't work on a console configured for iso-8859-1. Just print processed_x will automatically encode for UTF-8 if the console is configured for UTF-8. Redirection is a shell function, so leave specifying the encoding to the shell also with PYTHONIOENCODING=utf8 python pi.py >> py.txt. It also leaves the option open to use other encodings without modifying the script.Mark Tolonen

1 Answers

1
votes

Use:

PYTHONIOENCODING=utf8 python pi.py >> py.txt

But if your script explicitly encodes its output, such as:

print u'somestring'.encode('utf8')

this method won't work. However, scripts should just print Unicode and let the terminal decide the encoding, as in:

print u'somestring'

Python will automatically encode for UTF-8 if the console is configured for UTF-8.

For your redirection case, Python doesn't know what encoding to use when printing Unicode, so defaults to ascii. Since redirection is a shell function, leave specifying the encoding to the shell using:

PYTHONIOENCODING=utf8 python pi.py >> py.txt.

This leaves the option open to use other encodings without modifying the script.