Pandas read_csv does not raise exception for bad lines when names is specified

Question

pandas read_csv raises an exception (error_bad_lines) when encountering lines with too many fields. However, this does not happen when the argument names is specified..

Example a csv file with the format:

1, 2, 3
1, 2, 3
1, 2, 3, 4

read with pd.read_csv(filepath, header=None) correctly raises ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4 due to the additional column.

However, when 'names' is specified as an argument:

>>> pd.read_csv(filepath, names=['A', 'B', 'C'], header=None)
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3

there is no error raised and the 'too long/bad' line which should be skipped is included...

Is there a way to specify names and still have the ParserError be raised such that the too long/bad lines can be dropped with error_bad_lines=False?

user59271 user59271 · Accepted Answer · 2018-06-26T20:05:13

Seems like there is no tidy pandas solution to this. What you could do is load the file CSV with python's open() and then add a new header to the file string, in this way you will not modify the original file on the disk. After that you can load the file string with StringIO with pandas. This will preserve the error:

#python3
from io import StringIO
import pandas as pd
lines = open('./test.csv', 'r').readlines()
lines = ['A, B, C'] + lines
fileString = '\n'.join(lines)
df = pd.read_csv(StringIO(fileString), sep=",")

Pandas read_csv does not raise exception for bad lines when names is specified

2 Answers