This might be one for the philosophers... (or @Steve Weston or @Martin Morgan)
I've been having some issues with memory leaks when using parLapply, and after digging through enough threads on the matter, I think this question is well warranted. I've taken some time to try and figure this one out, and while I've got an inkling of a clue as to why the observed behavior happens, I'm lost as to how to resolve it.
Consider the following as a sourced script, saved as: parallel_question.R
rf.parallel<-function(n=10){
library(parallel)
library(randomForest)
rf.form<- as.formula(paste("Final", paste(c('x','y','z'), collapse = "+"), sep = " ~ "))
rf.df<-data.frame(Final=runif(10000),y=runif(10000),x=runif(10000),z=runif(10000))
rf.df.list<-split(rf.df,rep(1:n,nrow(rf.df))[1:nrow(rf.df)])
cl<-makeCluster(n)
rf.list<-parLapply(cl,rf.df.list,function(x,rf.form,n){
randomForest::randomForest(rf.form,x,ntree=100,nodesize=10, norm.votes=FALSE)},rf.form,n)
stopCluster(cl)
return(rf.list)
}
We source and run the script with:
scrip.loc<-"G:\\Scripts_Library\\R\\Stack_Answers\\parallel_question.R"
source(scrip.loc)
rf.parallel(n=10)
Fairly straight forward... we ran several random forest in parallel. Seems to be memory efficient. We could combine them later, or do something else. Handy. Nice. Well behaved.
Now consider the following script, saved as parallel_question_2.R
rf.parallel_2<-function(n=10){
library(parallel)
library(magrittr)
library(randomForest)
rf.form<- as.formula(paste("Final", paste(c('x','y','z'), collapse = "+"), sep = " ~ "))
rf.df<-data.frame(Final=runif(10000),y=runif(10000),x=runif(10000),z=runif(10000))
large.list<-rep(rf.df,10000)
rf.df.list<-split(rf.df,rep(1:n,nrow(rf.df))[1:nrow(rf.df)])
cl<-makeCluster(n)
rf.list<-parLapply(cl,rf.df.list,function(x,rf.form,n){
randomForest::randomForest(rf.form,x,ntree=100,nodesize=10, norm.votes=FALSE)},rf.form,n)
stopCluster(cl)
return(rf.list)
}
In this second script, we've got a large list in our sourced environment. We are not calling the list or bringing it into our parallel function. I've set the size of the list to probably be a problem on at least a 32gb machine.
scrip.loc<-"G:\\Scripts_Library\\R\\Stack_Answers\\parallel_question_2.R"
source(scrip.loc)
rf.parallel_2(n=10)
When we run the second script, we end up carrying around ~3gb (the size of our large list) * the number of worker threads set to the cluster, additional material around. If we run the contents of the second script in a non-sourced environment, this is not the behavior; rather, we get one ~3gb list, the parallelized function runs without issue, and thats the end of it.
So.. how/why are the worker environments taking uneccessary variables elements from the parent environment? Why does it only happen in sourced scripts? How can I mitigate for this when I have a sourced, large and complex script, which has sub-sections which are parallelized (but may have 3-10gb of intermediate data being carried around)?
Relevant or similar threads: