Can't find correct codec for reading csv

Question

I am reading a csv file, this come for some traces that are from a network protocol, hexa characters and normal mixed. I am trying to read a .csv, and I have tried several encodings: utf-8, cp1252, latin1...

For latin1:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xb0' in position 51: ordinal not in range(128)

For utf-8:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xb0 in position 51: invalid start byte

For cp1252:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xb0' in position 51: ordinal not in range(128)

The code used is:

df=pd.read_csv(file,sep='`',error_bad_lines=False,encoding='cp1252',names=colnames,quotechar='"')

I am no big expert in encoding(s), but I would like to know how to solve it.

Find out the current encoding of the csv file i am reading?

Is there a very permissive codec that takes pretty much everything?

Thanks.

For utf-8 you get an expected UnicodeDecodeError in read_csv because it contains a non utf-8 character. For Latin1 and cp1252 you get a UnicodeEncodeError (note Encode instead of Decode) probably in a different instruction. I need the full stacktraces and the relevant code to be able to help you. — Serge Ballesta
Only the writer knows the character encoding. Ask, read documentation, applicable standards or watch for ways in which it is communicated (e.g., HTTP response header Content-Type.) Or—if the source supports it—use a format that doesn't suffer from this problem, such as .ods or .xlsx. — Tom Blodget
with cp1252 you shouldn't get error 'ascii' codec can't encode character u'\xb0'. You can get it only with ascii. Using b'\xb0'.decode('cp1252') I see it can be ° (degree sign) — furas
What @furas says is correct about both latin1 and cp1252. Latin1 is your answer (amongst the dozens of character encodings that use all 256 bytes values in any order). It seems a bad test led to the question. — Tom Blodget

tripleee tripleee · Accepted Answer · 2019-07-04T13:07:00

CSV is a text format; it's not really suitable for storing arbitrary blobs of binary data.

To solve your immediate problem, you can specify ´latin-1' as the encoding. This codec has the unique feature that each single-byte character corresponds to exactly the same Unicode code point.

Beware, though, that this is somewhat likely to produce various kinds of mojibake if you are not careful. You should probably pull out any binary data, then decode the remaining fields to proper Unicode as soon as possible. Here's a simple pure Python snippet for UTF-8 with one field containing binary.

with open(filename, encoding='latin-1') as input:
    reader = csv.reader(input)
    for row in reader:
        binary = row[42].encode('latin-1')
        newrow = []
        for field in row:
            newrow.append(field.encode('latin-1').decode('utf-8'))
        newrow[42] = binary
        # newrow is now decoded UTF-8 except field 42 which is a bytes object

Can't find correct codec for reading csv

3 Answers