26
votes

I try to read and print the following file: txt.tsv (https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2017q3_notes.zip)

According to the SEC the data set is provided in a single encoding, as follows:

Tab Delimited Value (.txt): utf-8, tab-delimited, \n- terminated lines, with the first line containing the field names in lowercase.

My current code:

import csv

with open('txt.tsv') as tsvfile:
    reader = csv.DictReader(tsvfile, dialect='excel-tab')
    for row in reader:
        print(row)

All attempts ended with the following error message:

'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte

I am a bit lost. Can anyone help me? Many thanks in advance.

5
Can we see the file you are using? - dangee1705
Also, is this Python 2 or 3? The answer is very important, since the csv module is broken for non-ASCII on Python 2. - ShadowRanger
I am using Python 3.6.0 - Vital
Hmm... On rereading the error, I'm pretty sure the problem is your input file. The error indicates it is trying to read it as utf-8, so your input likely doesn't follow the format described. That said, the file you linked seems to follow it just fine (it's pure ASCII AFAICT; it uses some unusual ASCII control characters, but they're all in the ASCII range), so I'm not sure where you'd see a \xa0 byte. Is it possible you modified the file by accident before using it? - ShadowRanger
see below the answer of Kopytok. if I change the encoding to 'windows-1252' it works perfect. - Vital

5 Answers

39
votes

Encoding in the file is 'windows-1252'. Use:

open('txt.tsv', encoding='windows-1252')
3
votes

If someone works on Turkish data, then I suggest this line:

df = pd.read_csv("text.txt",encoding='windows-1254')
2
votes
ds = pd.read_csv('/Dataset/test.csv', encoding='windows-1252') 

Works fine for me, thanks.

1
votes

i have the same error message for .csv file, and This Worked for me :

     df = pd.read_csv('Text.csv',encoding='ANSI')
0
votes

If the input has a stray '\xa0', then it's not in UTF-8, full stop.

Yes, you have to either recode it to UTF-8 (see: iconv, recode commands, or a lot of text editors and IDEs can do it), or read it using an 8-bit encoding (as all the other answers suggest).

What you should ask yourself is - what is this character after all (0xa0 or 160)? Well, in many 8-bit encodings it's a non-breaking space (like   in HTML). For at least one DOS encoding it's an accented "a" character. That's why you need to look at the result of decoding it from the 8-bit encoding.

BTW, sometimes people say "UTF-8", and they mean "mostly ASCII, I guess". And if it was a non-breaking space, they weren't that far:

In [1]: '\xa0'.encode()
Out[1]: b'\xc2\xa0'

One exptra preceeding '\xc2' byte would do the trick.