Handle wrongly encoded character in Python unicode string

Question

I am dealing with unicode strings returned by the python-lastfm library.

I assume somewhere on the way, the library gets the encoding wrong and returns a unicode string that may contain invalid characters.

For example, the original string i am expecting in the variable a is "Glück"

>>> a
u'Gl\xfcck'
>>> print a
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2: ordinal not in range(128)

\xfc is the escaped value 252, which corresponds to the latin1 encoding of "ü". Somehow this gets embedded in the unicode string in a way python can't handle on its own.

How do i convert this back a normal or unicode string that contains the original "Glück"? I tried playing around with the decode/encode methods, but either got a UnicodeEncodeError, or a string containing the sequence \xfc.

possible duplicate of BeautifulSoup findall with class attribute- unicode encode error — Andreas Jung

Croad Langshan Croad Langshan · Accepted Answer · 2011-04-22T23:29:52

Your unicode string is fine:

>>> unicodedata.name(u"\xfc")
'LATIN SMALL LETTER U WITH DIAERESIS'

The problem you see at the interactive prompt is that the interpreter doesn't know what encoding to use to output the string to your terminal, so it falls back to the "ascii" codec -- but that codec only knows how to deal with ASCII characters. It works fine on my machine (because sys.stdout.encoding is "UTF-8" for me -- likely because something like my environment variable settings differ from yours)

>>> print u'Gl\xfcck'
Glück

Handle wrongly encoded character in Python unicode string

5 Answers