15
votes

I have read the HOWTO on Unicode from the official docs and a full, very detailed article as well. Still I don't get it why it throws me this error.

Here is what I attempt: I open an XML file that contains chars out of ASCII range (but inside allowed XML range). I do that with cfg = codecs.open(filename, encoding='utf-8, mode='r') which runs fine. Looking at the string with repr() also shows me a unicode string.

Now I go ahead and read that with parseString(cfg.read().encode('utf-8'). Of course, my XML file starts with this: <?xml version="1.0" encoding="utf-8"?>. Although I suppose it is not relevant, I also defined utf-8 for my python script, but since I am not writing unicode characters directly in it, this should not apply here. Same for the following line: from __future__ import unicode_literals which also is right at the beginning.

Next thing I pass the generated Object to my own class where I read tags into variables like this: xmldata.getElementsByTagName(tagName)[0].firstChild.data and assign it to a variable in my class.

Now what perfectly works are those commands (obj is an instance of the class):

for element in obj:
    print element

And this command does work as well:

print obj.__repr__()

I defined __iter__() to just yield every variable while __repr__() uses the typical printf stuff: "%s" % self.varname

Both commands print perfectly and can output the unicode character. What does not work is this:

print obj

And now I am stuck because this throws the dreaded

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 47:

So what am I missing? What am I doing wrong? I am looking for a general solution, I always want to handle strings as unicode, just to avoid any possible errors and write a compatible program.

Edit: I also defined this:

def __str__(self):
    return self.__repr__()
def __unicode__(self):
    return self.__repr__()

From documentation I got that this

1
print obj will use the object's __str__, not __repr__.BrenBarn
What is your default encoding? I mean sys.getdefaultencoding()Maksym Polshcha
@BrenBarn : str is implemented as return __repr__()javex
@MaksymPolshcha: it is ascii according to the functionjavex
I would really recommend taking a look at this talk at Pycon 2012 Pragmatic Unicode, or, How do I stop the pain? youtube.com/watch?v=sgHbC6udIqcroot

1 Answers

5
votes

I finally solved it. The problem was (I am not sure why) that if you called either __str__() or __repr__() directly it would be hapyp to handle it well, but printing it directly (as in: print obj) does not work (although it should only just call __str__() itself).

The final help came from this article. I already got to the step where I got it to print to the console (but a wrong letter) when I used utf-8 encoding. Finally solved it to be perfectly correct by defining this:

def __str__(self):
    return self.__repr__().encode(stdout.encoding)

Now the only open question that remains is: Why do print obj.__str__() and print obj differently with this? It does make no sense to me. And yes, to stress that again: Calling the former or __repr__() DID work. And still does with the explicit encoding.