Subsetting from a file-stored dataframe (ffdf) exceeds memory

Question

I am working with a dataset of about 17M x 4 values on a 32bit Windows machine. This requires ~700 MB in GNU R - so when I try to do some enhanced operations, the 2 GB limit is easily reached and I receive an out-of-memory error (cannot allocate vector ...).

No problem - there is package "ff" to store such data on disk. However my first subset rund in the same error. According to the ff documentation I expected that "[" would subset directly into another ffdf without loading two copies of the data into memory. Where am I exactly wrong?

ffshares = read.table.ffdf(
  file=tmpFilename, header = FALSE, sep = ",", quote = "\"",
  dec = ".",
  col.names = c("articleID", "measure", "time", "value"),
  na.strings = c("","-1","\\N"),
  colClasses = c("integer","factor","POSIXct","integer"),
  check.names = TRUE, fill = TRUE,
  strip.white = FALSE, blank.lines.skip = TRUE,
  comment.char = "",
  allowEscapes = F, flush = F #, nrow=1000
)
# Until here, the R process requires about 200M

ffshares = ffshares[ffshares[,"articleID"] %in% articles[,"articleID"],]
# As I try this, memory consumption exceeds 1.7G and the available limits

Note: articles is a dataframe with ~30K rows. The articleID is a simple integer.

Bonus question: ffshares[,"articleID"] works but ffshares$articleID does not. According to the documentation the dollar ($) should work like in a dataframe?!

Thanks for any advice :)

what does ffshares[,"articleID"] give you? It is a vector, not an ff vector. This means you have extracted your data from your ff object in RAM memory. To avoid this, you can use the subset.ffdf function in the package ffbase. — user1600826
Okay - it's a good trick to hide the subset.ffdf in another package than "ff" :) I did not expect to work the subset function differently from the [. After importing 'ffbase' and using subset.ffdf() instead of [ also solves the memory issue. That's the solution. Wanna post it as answer? — BurninLeo
glad to have helped you out. feel free to post the answer if it will help other people. — user1600826

organofcorti organofcorti · Accepted Answer · 2012-10-19T22:23:52

I apologise for this not being a direct answer to your question, but I have used the bigmemory package for similarly sized objects. The ﬁlebacked.big.matrix function and the large matrix optimised operators. You might find it useful if you don't get a direct answer to your question.

Subsetting from a file-stored dataframe (ffdf) exceeds memory

1 Answers