2
votes

I'm try to import huge csv files into pandas Dataframe (200 cols and millions of lines).

I'm using the read_csv method which I give a dtypes dictionary in parameter in order to accelerate import.

I've got some exceptions about wrong format I give thought dtype like this :

ValueError: invalid literal for long() with base 10: ''

But there no reference to the line number or to the col. My Files are huge, this information will help me to save lot of time to find what's wrong in my dtypes structure.

Any idea ?

Edit :

To be more precise, I'm going to explain all the story. First I tried to read my csv file which this command line :

t = pd.read_csv(filename, sep=",")

It give me this error message :

C:\Python27\lib\site-packages\pandas\io\parsers.py:1159: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.

So I try to specify my dtype through this way (I'm not copy/paste the full dtype because there are 207 cols) :

dtype_file = {
  'a': pd.np.int16,
  'b': pd.np.int16,
...
}
pd.read_csv(filename, sep=",",dtypes=dtype_file, na_filter=False)
2
What parameters are you using with read_csv?Brian from QuantRocket

2 Answers

2
votes

In fact, I resolve it by myself using the low_memory parameter :

pd.read_csv(filename, sep=",", na_filter=False, low_memory=False)
1
votes

You would get that error if trying to coerce an empty string to long:

In [366]: long("")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-366-65e3f7aa7bfe> in <module>()
----> 1 long("")

ValueError: invalid literal for long() with base 10: ''

So perhaps you have some empty strings in your numeric column which are causing the dtype coercion to fail.