python - 'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte

Question

According to the SEC the data set is provided in a single encoding, as follows:

Tab Delimited Value (.txt): utf-8, tab-delimited, \n- terminated lines, with the first line containing the field names in lowercase.

My current code:

import csv

with open('txt.tsv') as tsvfile:
    reader = csv.DictReader(tsvfile, dialect='excel-tab')
    for row in reader:
        print(row)

All attempts ended with the following error message:

'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte

I am a bit lost. Can anyone help me? Many thanks in advance.

Also, is this Python 2 or 3? The answer is very important, since the csv module is broken for non-ASCII on Python 2. — ShadowRanger
Hmm... On rereading the error, I'm pretty sure the problem is your input file. The error indicates it is trying to read it as utf-8, so your input likely doesn't follow the format described. That said, the file you linked seems to follow it just fine (it's pure ASCII AFAICT; it uses some unusual ASCII control characters, but they're all in the ASCII range), so I'm not sure where you'd see a \xa0 byte. Is it possible you modified the file by accident before using it? — ShadowRanger
see below the answer of Kopytok. if I change the encoding to 'windows-1252' it works perfect. — Vital

koPytok koPytok · Accepted Answer · 2018-01-02T21:00:07

39

votes

Encoding in the file is 'windows-1252'. Use:

open('txt.tsv', encoding='windows-1252')