11
votes

I have a big data frame taking about 900MB ram. Then I tried to modify it like this:

dataframe[[17]][37544]=0 

It seems that makes R using more than 3G ram and R complains "Error: cannot allocate vector of size 3.0 Mb", ( I am on a 32bit machine.)

I found this way is better:

dataframe[37544, 17]=0

but R's footprint still doubled and the command takes quite some time to run.

From a C/C++ background, I am really confused about this behavior. I thought something like dataframe[37544, 17]=0 should be completed in a blink without costing any extra memory (only one cell should be modified). What is R doing for those commands I posted? What is the right way to modify some elements in a data frame then without doubling the memory footprint?

Thanks so much for your help!

Tao

4
(Base) R is famously not spectacular at handling large data structures. You'll want to look into some combination of the ff, bigmemory and data.table packages.joran
That's not really true - data.frames are famously inefficient, but there are very efficient structures in (base) R that you should use instead if you care for efficiency.Simon Urbanek
@SimonUrbanek Yes, I phrased that badly. I just meant exactly what you said, that data frames tend to be inefficient, and that the packages I mentioned can often be useful for folks dealing with large data.joran
@Simon Like what? matrix is restricted to a common column type, and can't be as large as a data.frame. Are you suggesting list perhaps?Matt Dowle

4 Answers

8
votes

Look up 'copy-on-write' in the context of R discussions related to memory. As soon as one part of a (potentially really large) data structure changes, a copy is made.

A useful rule of thumb is that if your largest object is N mb/gb/... large, you need around 3*N of RAM. Such is life with an interpreted system.

Years ago when I had to handle large amounts of data on machines with (relative to the data volume) relatively low-ram 32-bit machines, I got good use out of early versions of the bigmemory package. It uses the 'external pointer' interface to keep large gobs of memory outside of R. That save you not only the '3x' factor, but possibly more as you may get away with non-contiguous memory (which is the other thing R likes).

12
votes

Following up on Joran suggesting data.table, here are some links. Your object, at 900MB, is manageable in RAM even in 32bit R, with no copies at all.

When should I use the := operator in data.table?

Why has data.table defined := rather than overloading <-?

Also, data.table v1.8.0 (not yet on CRAN but stable on R-Forge) has a set() function which provides even faster assignment to elements, as fast as assignment to matrix (appropriate for use inside loops for example). See latest NEWS for more details and example. Also see ?":=" which is linked from ?data.table.

And, here are 12 questions on Stack Overflow with the data.table tag containing the word "reference".

For completeness :

require(data.table)
DT = as.data.table(dataframe)
# say column name 17 is 'Q' (i.e. LETTERS[17])
# then any of the following :

DT[37544, Q:=0]                # using column name (often preferred)

DT[37544, 17:=0, with=FALSE]   # using column number

col = "Q"
DT[37544, col:=0, with=FALSE]  # variable holding name

col = 17
DT[37544, col:=0, with=FALSE]  # variable holding number

set(DT,37544L,17L,0)           # using set(i,j,value) in v1.8.0
set(DT,37544L,"Q",0)

But, please do see linked questions and the package's documentation to see how := is more general than this simple example; e.g., combining := with binary search in an i join.

8
votes

Data frames are the worst structure you can choose to make modification to. Due to quite the complex handling of all features (such as keeping row names in synch, partial matching, etc.) which is done in pure R code (unlike most other objects that can go straight to C) they tend to force additional copies as you can't edit them in place. Check R-devel on the detailed discussions on this - it has been discussed in length several times.

The practical rule is to never use data frames for large data, unless you treat them read-only. You will be orders of magnitude more efficient if you either work on vectors or matrices.

4
votes

There is type of object called a ffdf in the ff package which is basically a data.frame stored on disk. In addition to the other tips above you can try that.

You can also try the RSQLite package.