1
votes

pandas refuses to read files that have too many commas (after the first line) :

Trying to read_csv the following :

col1,col2,col3
foo,1,2
bar,2,3
zob,0,3,4

Will give me an error

However, pandas accepts the following no matter the options I tried in read_csv :

col1,col2,col3
foo,1,2
bar,2,3
zob,0

And will just consider that the value in col3 for the last line is null

Is there any pandas way to raise an exception when this (too few fields in one row) happens ? (In my case, it means the source of the file is faulty and the file needs to be downloaded again).

It seems error_bad_lines only concern lines with too many commas.

I can count separately the number of commas for each line before using read_csv, but I'd like to know if an option within pandas exists because it seems more natural / to ease code readability.

1
what you expect to see for your exampleBENY
@Wen : Any exception that I can raise to tell my program that he needs to start over the download. The same behaviour as happens in the first example (read_csv errs because too many commas) would be great.WNG
Can your file contain NaN values?MaxU
If you are just checking for an incomplete download, you could first count the number of fields in the last row only - before parsing as csv. But even with a correct last row the download could be incomplete.Danny_ds
@MaxU no the file does not contain any NaN values, so indeed this could be a leadWNG

1 Answers

1
votes

UPDATE:

he file does not contain any NaN values

In [85]: pd.read_csv(fn)
Out[85]:
  col1  col2  col3
0  foo     1   2.1
1  bar     2   3.1
2  zob     0   NaN

so you can raise an exception if the following condition is met:

In [86]: pd.read_csv(fn).isnull().any().any()
Out[86]: True

Old answer:

Possible solution:

consider the following input CSV file:

col1,col2,col3
foo,1,2.1
bar,2,3.1
zob,0

the following works:

In [50]: pd.read_csv(fn, dtype={'col3':'float'})
Out[50]:
  col1  col2  col3
0  foo     1   2.1
1  bar     2   3.1
2  zob     0   NaN

but if we instruct Pandas not to treat empty string as NaN's, then it'll throw an exception:

In [51]: pd.read_csv(fn, na_values=['NAN','NaN','#NA'], keep_default_na=False, dtype={'col3':'float'})
...
skipped
...
ValueError: could not convert string to float: