52
votes

how does the unicode thing works on python2? i just dont get it.

here i download data from a server and parse it for JSON.

Traceback (most recent call last):
  File "/usr/local/lib/python2.6/dist-packages/eventlet-0.9.12-py2.6.egg/eventlet/hubs/poll.py", line 92, in wait
    readers.get(fileno, noop).cb(fileno)
  File "/usr/local/lib/python2.6/dist-packages/eventlet-0.9.12-py2.6.egg/eventlet/greenthread.py", line 202, in main
    result = function(*args, **kwargs)
  File "android_suggest.py", line 60, in fetch
    suggestions = suggest(chars)
  File "android_suggest.py", line 28, in suggest
    return [i['s'] for i in json.loads(opener.open('https://market.android.com/suggest/SuggRequest?json=1&query='+s+'&hl=de&gl=DE').read())]
  File "/usr/lib/python2.6/json/__init__.py", line 307, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.6/json/decoder.py", line 319, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.6/json/decoder.py", line 336, in raw_decode
    obj, end = self._scanner.iterscan(s, **kw).next()
  File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan
    rval, next_pos = action(m, context)
  File "/usr/lib/python2.6/json/decoder.py", line 217, in JSONArray
    value, end = iterscan(s, idx=end, context=context).next()
  File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan
    rval, next_pos = action(m, context)
  File "/usr/lib/python2.6/json/decoder.py", line 183, in JSONObject
    value, end = iterscan(s, idx=end, context=context).next()
  File "/usr/lib/python2.6/json/scanner.py", line 55, in iterscan
    rval, next_pos = action(m, context)
  File "/usr/lib/python2.6/json/decoder.py", line 155, in JSONString
    return scanstring(match.string, match.end(), encoding, strict)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-6: invalid data

thank you!!

EDIT: the following string causes the error: '[{"t":"q","s":"abh\xf6ren"}]'. \xf6 should be decoded to ö (abhören)

8
You are trying to interpret some data as UTF-8 that is not valid UTF-8. We don't know your data, so we can't tell you what it actually is. Maybe its a zipped JSON string? In that case, try .decode("zlib") first.Sven Marnach
most of the time it works, just sometimes it does not. i think probally it dont work when there are some special charachtersihucos
0xF6 is ö in ISO-8895-1 and other similar 8-bit encodings. If the original message being ISO-8859-1, rather than UTF-8, is beyond your control, then you can always do message = unicode(message, "ISO-8859-1")Tadeusz A. Kadłubowski
/usr/lib/python2.6/urllib.py:1222: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal and later: KeyError: u'\xf6'ihucos

8 Answers

85
votes

The string you're trying to parse as a JSON is not encoded in UTF-8. Most likely it is encoded in ISO-8859-1. Try the following:

json.loads(unicode(opener.open(...), "ISO-8859-1"))

That will handle any umlauts that might get in the JSON message.

You should read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). I hope that it will clarify some issues you're having around Unicode.

6
votes

My solution is a bit funny.I never thought that would it be as easy as save as with UTF-8 codec.I'm using notepad++(v5.6.8).I didn't notice that I saved it with ANSI codec initially. I'm using separate file to place all localized dictionary. I found my solution under 'Encoding' tab from my Notepad++.I select 'Encoding in UTF-8 without BOM' and save it. It works brilliantly.

4
votes

The error you're seeing means the data you receive from the remote end isn't valid JSON. JSON (according to the specifiation) is normally UTF-8, but can also be UTF-16 or UTF-32 (in either big- or little-endian.) The exact error you're seeing means some part of the data was not valid UTF-8 (and also wasn't UTF-16 or UTF-32, as those would produce different errors.)

Perhaps you should examine the actual response you receive from the remote end, instead of blindly passing the data to json.loads(). Right now, you're reading all the data from the response into a string and assuming it's JSON. Instead, check the content type of the response. Make sure the webpage is actually claiming to give you JSON and not, for example, an error message that isn't JSON.

(Also, after checking the response use json.load() by passing it the file-like object returned by opener.open(), instead of reading all data into a string and passing that to json.loads().)

3
votes

The solution to change the encoding to Latin1 / ISO-8859-1 solves an issue I observed with html2text.py as invoked on an output of tex4ht. I use that for an automated word count on LaTeX documents: tex4ht converts them to HTML, and then html2text.py strips them down to pure text for further counting through wc -w. Now, if, for example, a German "Umlaut" comes in through a literature database entry, that process would fail as html2text.py would complain e.g.

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 32243-32245: invalid data

Now these errors would then subsequently be particularly hard to track down, and essentially you want to have the Umlaut in your references section. A simple change inside html2text.py from

data = data.decode(encoding)

to

data = data.decode("ISO-8859-1")

solves that issue; if you're calling the script using the HTML file as first parameter, you can also pass the encoding as second parameter and spare the modification.

1
votes

Just in case of someone has the same problem. I'am using vim with YouCompleteMe, failed to start ycmd with this error message, what I did is: export LC_CTYPE="en_US.UTF-8", the problem is gone.

1
votes

Paste this on your command line:

export LC_CTYPE="en_US.UTF-8" 
0
votes

In your android_suggest.py, break up that monstrous one-liner return statement into one_step_at_a_time pieces. Log repr(string_passed_to_json.loads) somewhere so that it can be checked after an exception happens. Eye-ball the results. If the problem is not evident, edit your question to show the repr.

0
votes

Temporary workaround: unicode(urllib2.urlopen(url).read(), 'utf8') - this should work if what is returned is UTF-8.

urlopen().read() return bytes and you have to decode them to unicode strings. Also it would be helpful to check the patch from http://bugs.python.org/issue4733