4
votes

pandas read_csv raises an exception (error_bad_lines) when encountering lines with too many fields. However, this does not happen when the argument names is specified..

Example a csv file with the format:

1, 2, 3
1, 2, 3
1, 2, 3, 4

read with pd.read_csv(filepath, header=None) correctly raises ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4 due to the additional column.

However, when 'names' is specified as an argument:

>>> pd.read_csv(filepath, names=['A', 'B', 'C'], header=None)
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3

there is no error raised and the 'too long/bad' line which should be skipped is included...

Is there a way to specify names and still have the ParserError be raised such that the too long/bad lines can be dropped with error_bad_lines=False?

2

2 Answers

1
votes

Seems like there is no tidy pandas solution to this. What you could do is load the file CSV with python's open() and then add a new header to the file string, in this way you will not modify the original file on the disk. After that you can load the file string with StringIO with pandas. This will preserve the error:

#python3
from io import StringIO
import pandas as pd
lines = open('./test.csv', 'r').readlines()
lines = ['A, B, C'] + lines
fileString = '\n'.join(lines)
df = pd.read_csv(StringIO(fileString), sep=",")
0
votes

To make an educated guess based on your input example: The behavior you are experiencing may be due to the fact that you implicitly tell pd.read_csv to set len(usecols) = len(names). As a result, the column that causes your initial Exception will not be imported.

You will get your initial Exception back when you add as much header names to names as you have columns present in the csv-file:

# 1. Determine maximum column count
    sep = ','                                                   # Define separator
    lines = open(filepath).readlines()                        # Open file and read lines
    colcount = max([len(l.strip().split(sep)) for l in lines])  # Count separator

# 2. Add column headers
    df = pd.read_csv(filepath, names = range(colcount))
    # you can rename your columns of interest here in case of error_bad_lines = False

Now, the column with missing values will be included and your Exception will come back. Note that this way of counting the maximum number of columns only works for .csv files.