1
votes

I keep getting similar UnicodeDecodeErrors when I try to read files that contain a mix of English and Chinese characters that were supposedly encoded as UTF-8.

'utf-8' codec can't decode byte 0xed in position 88875: invalid continuation byte

I was told they are UTF-8 and when I open the files in Notepad++ it says they are UTF-8. But when I try to read the file as a pandas dataframe in Python I get the error. The only weird thing I am doing is using a bytes object because I'm moving the files using SharePlum from SharePoint.

bytes_case = folder.get_file('filename.csv')
case=pd.read_csv(io.BytesIO(bytes_case), encoding="utf-8")

Output of print(bytes_case[88875-100:88875+100]):

b'\xb7\xe7\xa0\x81\xe8\xae\xbe\xe7\xbd\xae\xe4\xba\x86\xe6\x8b\xa6\xe6\x88\xaa\xef\xbc\x8c\xe8\xaf\xb7\xe5\x85\x88\xe5\x8f\x96\xe6\xb6\x88\xef\xbc\x8c\xe4\xbb\xa5\xe4\xbe\xbf\xe4\xb8\x93\xe5\x91\x98\xe8\x83\xbd\xe9\xa1\xba\xe5\x88\xa9\xe4\xb8\x8e\xe6\x82\xa8\xe8\x81\x94\xe7\xb3\xbb\xe8\xb7\x9f\xe8\xbf\x9b\xe5\x93\x92\xe3\x80\x82\r\n03-10 15:12:34\r\n\xed\xa0\xbd\xed\xb1\x8c\r\n03-10 15:12:49\xe6\x88\x91\r\n\xe8\xaf\xb7\xe9\x97\xae\xe8\xbf\x98\xe6\x9c\x89\xe5\x85\xb6\xe4\xbb\x96\xe5\x8f\xaf\xe4\xbb\xa5\xe5\xb8\xae\xe6\x82\xa8\xe5\x90\x97\xef\xbc\x9f\r\n03-10 15:13:27\r\n\xe6\xb2\xa1\xe4\xba\x86\r\n03-10 15:13'

1
Hard to say without the file. Can you post the output of print(bytes_case[88875-100:88875+100])? - Justin Ezequiel
I've added the output above - Sasha
Tried a few encodings and could not find one that worked for your data. Perhaps they used mixed encodings when they wrote the file? - Justin Ezequiel

1 Answers

0
votes

I "fixed" it by decoding as UTF-8 with errors ignored and re-encoding it as UTF-8.

bytes_case = bytes_case.decode('utf-8', errors="ignore").encode('utf-8')

Not ideal.