0
votes

I am trying to merge relatively big CSV files in pandas. I've got a folder with 236 files and a total size of 246MB and I am trying to join those together into one big CSV file with the columns renamed to filename-close/filename-volume for each file, being indexed on a Unix timestamp. I expect that there are some holes in my data, so I have to use an outer join in order to merge them, but running the following script only makes it through 18 CSV files until it consumes 26GB of RAM and terminates with the error message:

MemoryError: Unable to allocate 1.52 GiB for an array with shape (2, 101966219) and data type float64

Here's the complete code:

import os
import pandas as pd
import math

ppFiles = os.listdir("./preprocessed")
fullDf = None
for f in ppFiles:
    df = pd.read_csv('./preprocessed/' + f)
    df = df.set_index('time')
    pair = f.split('.')[0]
    print("currently adding pair " + pair)
    df[pair + '-close'] = df['close']
    df[pair + '-volume'] = df['volume']
    df = df.drop(columns=["close", "volume"])
    #print(df.head())
    if fullDf is None:
        fullDf = df
    else:
        fullDf = fullDf.join(df, how="outer")
fullDf.to_csv("./preprocessed/0AllTogether.csv")

Are there any ways I can achieve this using pandas and a maximum of ~27GB of RAM or do I have to write my own concatenation that is text-based? I hoped that 32gigs of system memory would be enough to accomplish concatenating <250MB of files in memory...

Here's the first couple lines of my data, all csv files are formatted like this:

time,close,volume
1596549600,22.0,0.5
1596550500,22.0,0.0
1596551400,22.0,0.0
1596552300,23.0,28.0
1596553200,23.0,0.0
1596554100,40.0,206.79499996
1596555000,40.0,30.0
1596555900,24.4,10.033244
1596556800,20.99,24.1
1596557700,22.0,117.9426
1596558600,25.0,6.8823
1596559500,20.0,0.0
1596560400,21.9,7.25
1596561300,22.0,14.85497737
1596562200,22.0,0.0
1596563100,19.0,28.59515365
1596564000,19.0,11.16297118
1596564900,19.0,10.62149454
1596565800,19.0,0.0
1596566700,24.0,125.1021233
1
Joins can be memory intensive.juanpa.arrivillaga
what fo you mean. by holes in your data?juanpa.arrivillaga
I mean that hopefully I have 15 minutes (900s) between each data point but before I have analysed the data (which I want to do after this step so I can analyse everything at once instead of having to do it on 236 files) I am fairly certain that in the last years of collecting this dataset there have been some mistakes, leading to datapoints that have not been recorded.Taxel

1 Answers

1
votes

You can perform some operations to save resources such as

  • Specify the datatypes when using pd.read_csv() using dtype parameter
  • downcasting float64 to float
  • downcasting int64 to int
  • transform object to category when possible