12
votes

I have a dictionary data where I have stored:

  • key - ID of an event

  • value - the name of this event, where value is a UTF-8 string

Now, I want to write down this map into a json file. I tried with this:

with open('events_map.json', 'w') as out_file:
    json.dump(data, out_file, indent = 4)

but this gives me the error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xbf in position 0: invalid start byte

Now, I also tried with:

with io.open('events_map.json', 'w', encoding='utf-8') as out_file:
   out_file.write(unicode(json.dumps(data, encoding="utf-8")))

but this raises the same error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xbf in position 0: invalid start byte

I also tried with:

with io.open('events_map.json', 'w', encoding='utf-8') as out_file:
    out_file.write(unicode(json.dumps(data, encoding="utf-8", ensure_ascii=False)))

but this raises the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xbf in position 3114: ordinal not in range(128)

Any suggestions about how can I solve this problem?

EDIT: I believe this is the line that is causing me the problem:

> data['142']
'\xbf/ANCT25'

EDIT 2: The data variable is read from a file. So, after reading it from a file:

data_file_lines = io.open(file_name, 'r', encoding='utf8').readlines()

I then do:

with io.open('data/events_map.json', 'w', encoding='utf8') as json_file:
        json.dump(data, json_file, ensure_ascii=False)

Which gives me the error:

TypeError: must be unicode, not str

Then, I try to do this with the data dictionary:

for tuple in sorted_tuples (the `data` variable is initialized by a tuple):
    data[str(tuple[1])] = json.dumps(tuple[0], ensure_ascii=False, encoding='utf8')

which is, again, followed by:

with io.open('data/events_map.json', 'w', encoding='utf8') as json_file:
    json.dump(data, json_file, ensure_ascii=False)

but again, the same error:

TypeError: must be unicode, not str

I get the same error when I use the simple open function for reading from the file:

data_file_lines = open(file_name, "r").readlines()
1
The string in your data dictionary is not actually UTF-8 encoded; decoding it to Unicode fails.Martijn Pieters♦
Can you please put the actual data dictionary in your post? Just include the output of print data.Martijn Pieters♦
The data variable is too big to paste it. Anyway, I think only one entry on my dictionary is causing the problem. I edited my post.Belphegor
That string is indeed not UTF-8 encoded. Is that supposed to be an inverted question mark, perhaps?Martijn Pieters♦
You'll have to either replace that value with an actual UTF-8 encoded value, or replace it with a Unicode value (so explicitly decode it first before passing it to json.dump()).Martijn Pieters♦

1 Answers

17
votes

The exception is caused by the contents of your data dictionary, at least one of the keys or values is not UTF-8 encoded.

You'll have to replace this value; either by substituting a value that is UTF-8 encoded, or by decoding it to a unicode object by decoding just that value with whatever encoding is the correct encoding for that value:

data['142'] = data['142'].decode('latin-1')

to decode that string as a Latin-1-encoded value instead.