I am trying to process > 10000 xts objects saved on disk each being around 0.2 GB when loaded into R. I would like to use foreach to process these in parallel. My code works for something like 100 xts objects which I pre-load in memory, export etc. but after > 100 xts objects I hit memory limits on my machine.
Example of what I am trying to do:
require(TTR)
require(doMPI)
require(foreach)
test.data <- runif(n=250*10*60*24)
xts.1 <- xts(test.data, order.by=as.Date(1:length(test.data)))
xts.1 <- cbind(xts.1, xts.1, xts.1, xts.1, xts.1, xts.1)
colnames(xts.1) <- c("Open", "High", "Low", "Close", "Volume", "Adjusted")
print(object.size(xts.1), units="Gb")
xts.2 <- xts.1
xts.3 <- xts.1
xts.4 <- xts.1
save(xts.1, file="xts.1.rda")
save(xts.2, file="xts.2.rda")
save(xts.3, file="xts.3.rda")
save(xts.4, file="xts.4.rda")
names <- c("xts.1", "xts.2", "xts.3", "xts.4")
rm(xts.1)
rm(xts.2)
rm(xts.3)
rm(xts.4)
cl <- startMPIcluster(count=2) # Use 2 cores
registerDoMPI(cl)
result <- foreach(name=names,
.combine=cbind,
.multicombine=TRUE,
.inorder=FALSE,
.packages=c("TTR")) %dopar% {
# TODO: Move following line out of worker. One (or 5, 10,
# 20, ... but not all) object at a time should be loaded
# by master and exported to worker "just in time"
load(file=paste0(name, ".rda"))
return(last(SMA(get(name)[, 1], 10)))
}
closeCluster(cl)
print(result)
So I am wondering how I would be able to load each (or several like 5, 10, 20, 100, ... but not all at once) xts object from disk "just in time" before they are sent/needed by/exported to worker(s). I can not load the object in worker (based on name and folder where it is stored on disk) since workers can be on remote machines without access to the folder where objects are stored on disk. So I need to be able to read/load them "just in time" in main process...
I am using doMPI and doRedis as parallel back-end. doMPI seems more memory efficient but slower than doRedis (on 100 objects).
So I would like to understand what is a proper "strategy"/"pattern" to approach this problem.