4
votes

I've a 9 column data.frame (x) and it has millions of rows. I was able to read it into R, successfully do some modifications on it and the code would execute without a problem. However, when I try to write it out to a .csv file using

write.csv(x,file=argv[2],quote=F,row.names=F)

I get an error which says

Error: cannot allocate vector of size 1.2Gb

This makes no sense as the data is already in memory, the computations done, and all I want to do is write it out to disk. Also, while I monitored the memory, the virtual memory size almost doubled for this process during this write phase. Would writing a custom C function to write out this data.frame help? Any suggestions/help/pointers appreciated.

ps: I'm running all this in a 64 bit ubuntu box with about 24G RAM. Overall space may not be an issue. The data size is about 10G

1
The simplest thing to do is to write it to file in small pieces using append = TRUE. As an aside, the total RAM installed on your machine can be a misleading indicator of whether you'll have memory issues, as R frequently needs contiguous blocks of memory of a particular size. Even with 24Gb, finding 10 contiguous Gb of memory might be a challenge at times.joran
What @joran said. You could try gc() immediately beforehand but it's unlikely to help much.Ari B. Friedman
You can try saving the object as an .RData image, and loading it in a new session. For some reason it consumes less memory than an object directly after computation. I use this trick sometimes when I experience memory problems.sus_mlm

1 Answers

9
votes

You have to understand that R functions will often copy arguments, if they modify them, as the functional programming paradigm employed by R decrees that functions don't change the objects passed in as arguments; so R copies them when changes need to be made in the course of executing a function.

If you build R with memory tracing support you can see this copying in action for any operation you are having trouble with. Using the airquality example data set, tracing memory use I see

> head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
> tracemem(airquality)
[1] "<0x12b4f78>"
> write.csv(airquality, "airquality.csv")
tracemem[0x12b4f78 -> 0x1aac0d8]: as.list.data.frame as.list lapply unlist which write.table eval eval eval.parent write.csv 
tracemem[0x12b4f78 -> 0x1aabf20]: as.list.data.frame as.list lapply sapply write.table eval eval eval.parent write.csv 
tracemem[0x12b4f78 -> 0xf8ae08]: as.list.data.frame as.list lapply write.table eval eval eval.parent write.csv 
tracemem[0x12b4f78 -> 0xf8aca8]: write.table eval eval eval.parent write.csv 
tracemem[0xf8aca8 -> 0xca7fe0]: [<-.data.frame [<- write.table eval eval eval.parent write.csv 
tracemem[0xca7fe0 -> 0xcaac50]: [<-.data.frame [<- write.table eval eval eval.parent write.csv

So that indicates 6 copies of the data are being made as R prepares it for writing to file.

Clearly that is eating up the 24Gb of RAM you have available; the error says that R needs another 1.2Gb of RAM to complete an operation.

The simplest solution to start with would be to write the file in chunks. Write the first set of lines of data out using append = FALSE, then use append = TRUE for subsequent calls to write.csv() writing out the remaining chunks. You may need to play around with this to find an chunk size that will not exceed the available memory.