2
votes

Python (2.x) builtin json library supports encoding both unicode & utf-8 encoded (non-ASCII) strings - but apparently not at the same time. Try:

import json; json.dumps([u'Ä', u'Ä'.encode("utf-8")], ensure_ascii=False)

and see it raise a UnicodeDecodeError. Whereas both:

json.dumps([u'Ä'], ensure_ascii=False)

and

json.dumps([u'Ä'.encode("utf-8")], ensure_ascii=False)

...work ok.

Why does JSON encoding of data with both unicode & utf-8 encoded (non-ASCII) strings produce an UnicodeDecodeError? My Python site encoding is ASCII.

1
BTW, at least ujson handles this kind of case just fine.Petri
json.dumps(['Ä'.encode("utf-8")], ensure_ascii=False) does not work ok.luoluo
Which Python version? I think you should use u'Ä'.encode("utf-8") (note the u!)RemcoGerlich
@Petri: you should fix it on the other line too.RemcoGerlich
"UTF-8 encoded" is not a string. It's just a sequence of bytes.Matthias

1 Answers

3
votes

It doesn't work because it doesn't know what kind of output string to produce.

In my Python 2.7:

>>> json.dumps([u'Ä'], ensure_ascii=False)
u'["\xc4"]'

(a Unicode string)

and

>>> json.dumps([u'Ä'.encode("utf-8")], ensure_ascii=False)
'["\xc3\x84"]'

(a UTF8-encoded byte string)

So if you give it UTF8-encoded byte strings, it produces a UTF8-encoded byte string JSON, and if you give it Unicode strings, it produces a Unicode JSON.

If you mix them, it can't do both.

To fix this, you can give an explicit encoding argument (even though the default is correct) and it seems that it makes the result a unicode string always then:

>>> import json; json.dumps([u'Ä', u'Ä'.encode("utf-8")], ensure_ascii=False, encoding="UTF8")
u'["\xc4", "\xc4"]'