I am working with a dataset of about 17M x 4 values on a 32bit Windows machine. This requires ~700 MB in GNU R - so when I try to do some enhanced operations, the 2 GB limit is easily reached and I receive an out-of-memory error (cannot allocate vector ...).
No problem - there is package "ff" to store such data on disk. However my first subset rund in the same error. According to the ff documentation I expected that "[" would subset directly into another ffdf without loading two copies of the data into memory. Where am I exactly wrong?
ffshares = read.table.ffdf(
file=tmpFilename, header = FALSE, sep = ",", quote = "\"",
dec = ".",
col.names = c("articleID", "measure", "time", "value"),
na.strings = c("","-1","\\N"),
colClasses = c("integer","factor","POSIXct","integer"),
check.names = TRUE, fill = TRUE,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "",
allowEscapes = F, flush = F #, nrow=1000
)
# Until here, the R process requires about 200M
ffshares = ffshares[ffshares[,"articleID"] %in% articles[,"articleID"],]
# As I try this, memory consumption exceeds 1.7G and the available limits
Note: articles is a dataframe with ~30K rows. The articleID is a simple integer.
Bonus question: ffshares[,"articleID"] works but ffshares$articleID does not. According to the documentation the dollar ($) should work like in a dataframe?!
Thanks for any advice :)