I am trying to merge relatively big CSV files in pandas.
I've got a folder with 236 files and a total size of 246MB and I am trying to join those together into one big CSV file with the columns renamed to filename-close
/filename-volume
for each file, being indexed on a Unix timestamp.
I expect that there are some holes in my data, so I have to use an outer join in order to merge them, but running the following script only makes it through 18 CSV files until it consumes 26GB of RAM and terminates with the error message:
MemoryError: Unable to allocate 1.52 GiB for an array with shape (2, 101966219) and data type float64
Here's the complete code:
import os
import pandas as pd
import math
ppFiles = os.listdir("./preprocessed")
fullDf = None
for f in ppFiles:
df = pd.read_csv('./preprocessed/' + f)
df = df.set_index('time')
pair = f.split('.')[0]
print("currently adding pair " + pair)
df[pair + '-close'] = df['close']
df[pair + '-volume'] = df['volume']
df = df.drop(columns=["close", "volume"])
#print(df.head())
if fullDf is None:
fullDf = df
else:
fullDf = fullDf.join(df, how="outer")
fullDf.to_csv("./preprocessed/0AllTogether.csv")
Are there any ways I can achieve this using pandas and a maximum of ~27GB of RAM or do I have to write my own concatenation that is text-based? I hoped that 32gigs of system memory would be enough to accomplish concatenating <250MB of files in memory...
Here's the first couple lines of my data, all csv files are formatted like this:
time,close,volume
1596549600,22.0,0.5
1596550500,22.0,0.0
1596551400,22.0,0.0
1596552300,23.0,28.0
1596553200,23.0,0.0
1596554100,40.0,206.79499996
1596555000,40.0,30.0
1596555900,24.4,10.033244
1596556800,20.99,24.1
1596557700,22.0,117.9426
1596558600,25.0,6.8823
1596559500,20.0,0.0
1596560400,21.9,7.25
1596561300,22.0,14.85497737
1596562200,22.0,0.0
1596563100,19.0,28.59515365
1596564000,19.0,11.16297118
1596564900,19.0,10.62149454
1596565800,19.0,0.0
1596566700,24.0,125.1021233