The easy answer to this is "buy more RAM" but I am hoping to get a more constructive answer and learn something in the process.
I am running Windows 7 64-bit with 8GB of RAM.
I have several very large .csv.gz files (~450MB uncompressed) with the same exact header information that I read into R and perform some processing on. Then, I need to combine the processed R objects into a single master object and write back out to .csv on disk.
I do this same operation on multiple sets of files. As an example, I have 5 folders each with 6 csv.gz files in them. I need to end up with 5 master files, one for each folder.
My code looks something like the following:
for( loop through folders ){
master.file = data.table()
for ( loop through files ) {
filename = list.files( ... )
file = as.data.table ( read.csv( gzfile( filename ), stringsAsFactors = F ))
gc()
...do some processing to file...
# append file to the running master.file
if ( nrow(master.file) == 0 ) {
master.file = file
} else {
master.file = rbindlist( list( master.file, file) )
}
rm( file, filename )
gc()
}
write.csv( master.file, unique master filename, row.names = FALSE )
rm( master.file )
gc()
}
This code does not work. I get the cannot allocate memory
error before it writes out the final csv. I was watching resource monitor while running this code and don't understand why it would be using 8GB of RAM to do this processing. The total of all the file sizes is roughly 2.7GB, so I was expecting that the maximum memory R would use is 2.7GB. But the write.csv operation seems to use the same amount of memory as the data object you are writing, so if you have a 2.7GB object in memory and try to write it out, you would be using 5.6 GB of memory.
This apparent reality, combined with using a for
loop in which memory doesn't seem to be getting adequately freed up seems to be the problem.
I suspect that I could use the sqldf
package as mentioned here and here but when I set the sqldf
statement equal to an R variable I ended up with the same out of memory errors.
write.table(yourFirstDataTable, sep = ",", file = YourFile.csv)
for the first file read in and processed, and thenwrite.table(yourOtherDataTables, sep = ",", file = YourFile.csv, append = TRUE, col.names = FALSE)
? – A5C1D2H2I1M1N2O1R2T1