0
votes

I am reading a csv file, this come for some traces that are from a network protocol, hexa characters and normal mixed. I am trying to read a .csv, and I have tried several encodings: utf-8, cp1252, latin1...

For latin1:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xb0' in position 51: ordinal not in range(128)

For utf-8:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xb0 in position 51: invalid start byte

For cp1252:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xb0' in position 51: ordinal not in range(128)

The code used is:

df=pd.read_csv(file,sep='`',error_bad_lines=False,encoding='cp1252',names=colnames,quotechar='"')

I am no big expert in encoding(s), but I would like to know how to solve it.

Find out the current encoding of the csv file i am reading?

Is there a very permissive codec that takes pretty much everything?

Thanks.

3
What's inside your csv ? - Maxouille
For utf-8 you get an expected UnicodeDecodeError in read_csv because it contains a non utf-8 character. For Latin1 and cp1252 you get a UnicodeEncodeError (note Encode instead of Decode) probably in a different instruction. I need the full stacktraces and the relevant code to be able to help you. - Serge Ballesta
Only the writer knows the character encoding. Ask, read documentation, applicable standards or watch for ways in which it is communicated (e.g., HTTP response header Content-Type.) Or—if the source supports it—use a format that doesn't suffer from this problem, such as .ods or .xlsx. - Tom Blodget
with cp1252 you shouldn't get error 'ascii' codec can't encode character u'\xb0'. You can get it only with ascii. Using b'\xb0'.decode('cp1252') I see it can be ° (degree sign) - furas
What @furas says is correct about both latin1 and cp1252. Latin1 is your answer (amongst the dozens of character encodings that use all 256 bytes values in any order). It seems a bad test led to the question. - Tom Blodget

3 Answers

1
votes

CSV is a text format; it's not really suitable for storing arbitrary blobs of binary data.

To solve your immediate problem, you can specify ´latin-1' as the encoding. This codec has the unique feature that each single-byte character corresponds to exactly the same Unicode code point.

Beware, though, that this is somewhat likely to produce various kinds of mojibake if you are not careful. You should probably pull out any binary data, then decode the remaining fields to proper Unicode as soon as possible. Here's a simple pure Python snippet for UTF-8 with one field containing binary.

with open(filename, encoding='latin-1') as input:
    reader = csv.reader(input)
    for row in reader:
        binary = row[42].encode('latin-1')
        newrow = []
        for field in row:
            newrow.append(field.encode('latin-1').decode('utf-8'))
        newrow[42] = binary
        # newrow is now decoded UTF-8 except field 42 which is a bytes object
-1
votes

First find out the type of encoding then use it for reading

To find encoding type:

Method:1 You can just open the file using notepad and then goto File -> Save As. Next to the Save button there will be an encoding drop down and the file's current encoding will be selected there. Method:2 In Linux systems, you can use file command. It will give the correct encoding

    > file sub01.csv

sub01.csv: ASCII text

-1
votes

for reading a csv file

import csv
def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
   csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
   for row in csv_reader:
      yield [unicode(cell, 'utf-8') for cell in row]
   filename = 'da.csv'
   reader = unicode_csv_reader(open(filename))
   for field1, field2, field3 in reader:
       print field1, field2, field3