speed up the process of import multiple csv into python dataframe

Question

I would like to read multiple CSV files (hundreds of files,hundreds of lines each but with the same number of columns) from a target directory into a single Python Pandas DataFrame.

The code below I wrote works but too slow.It takes minutes to run 30 files(so how long should I wait if I load all of my files). What can I alter to make it work faster?

Besides, in replacefunction, I want to replace a "_"(don't know the encoding, but not a normal one) to a "-"(normal utf-8), how can I do with that? I use coding=latin-1because I have french accents in the files.

#coding=latin-1

import pandas as pd
import glob

pd.set_option('expand_frame_repr', False)

path = r'D:\Python27\mypfe\data_test'
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_, index_col = None, header = 0, sep = ';', dayfirst = True, 
                     parse_dates=['HeurePrevue','HeureDebutTrajet','HeureArriveeSurSite','HeureEffective'])
    df.drop(labels=['aPaye','MethodePaiement','ArgentPercu'],axis=1,inplace=True)
    df['Sens'].replace("\n", "-", inplace=True,regex=True)
    list_.append(df)

    print "fichier lu:",file_

frame = pd.concat(list_)

print frame

Can you just concatenate all files together before you ever get to Python? That way just one file gets read? An alternative would be to zip the files together, which will give you random access but with the benefits of the operating system's file cache. — chrisaycock
I don't really understand how to concatenate before getting to Python. To make all the files into one csv file? And for zip,in which step? — ch36r5s
Yes, make one big CSV file or make a zip file. These have nothing to do with Python; you'll have to do this externally to Python. Once you have one file with all of the data, then have your Python script read that file directly. — chrisaycock
It sounds like a solution. But how to do that automatically? And how to make a zip file? I still prefer to do it inside python, since I don't know exactly how many files I have to read. — ch36r5s

MaxU - stop WAR against UA MaxU - stop WAR against UA · Accepted Answer · 2016-07-12T15:08:50

you may try the following - read only those columns that really need, use list comprehension and call pd.concat([ ... ], ignore_index=True) once, because it's pretty slow:

# there is no sense to read columns that you don't need
# specify the column list (EXCLUDING: 'aPaye','MethodePaiement','ArgentPercu')
cols = ['col1', 'col2', 'etc.']
date_cols = ['HeurePrevue','HeureDebutTrajet','HeureArriveeSurSite','HeureEffective']

df = pd.concat(
        [pd.read_csv(f, sep = ';', dayfirst = True, usecols=cols, 
                     parse_dates=date_cols)
         for f in allFiles
        ],
        ignore_index=True
     )

this should work if you have enough memory to store two resulting DFs...

speed up the process of import multiple csv into python dataframe

1 Answers