1
votes

I am running a process in parallel using the doParallel/Foreach backend in R. I'm registering a set of 20 cores as a cluster, and running the process about 100 times. I'm passing a matrix to each iteration of the parallel processes, and in the sub-process I replace the matrix with a random sample of its own rows. What I'm wondering is: should I expect that this modification persists for subsequent iterations handled by the same child process? E.g., when child process 1 finishes its first iteration, does it start the second iteration with the original matrix, or the random sample?

A minimal example:

   library(doParallel)

   X <- matrix(1:400, ncol=4)

   cl<-makeCluster(2)
   clusterExport(X)
   registerDoParallel(cl)


   results<-foreach(i=1:100) %dopar% {
       set.seed(12345)
       X <- X[sample.int(nrow(X),replace=TRUE),]
       X
   }

EDIT:

To be clear, if indeed the object will persist across iterations by the same worker process, this is not my desired behavior. Rather, I want to have each iteration take a fresh random sample of the original matrix, not a random sample of the most recent random sample (I recognize that in my minimal example it would moreover create the same random sample of the original matrix each time, due to the seed set--in my actual application I deal with this).

1

1 Answers

1
votes

Side effects within a cluster worker that persistent across iterations of a foreach loop are possible, but that is not a supported feature of foreach. Programs that take advantage of it probably won't be portable to different parallel backends, and may not work with newer versions of the software. In fact, I tried to make that kind of side effect impossible when I first wrote foreach, but I eventually gave up.

Note that in your case, you're not modifying the copy of X that was explicitly exported to the workers: you're modifying a copy that was auto-exported to the workers by doParallel. That has probably been a source of confusion to you.

If you really want to do this, I suggest that you turn off auto-exporting of X and then modify the explicitly exported copy so that the program should be well defined and portable, although a bit ugly. Here's an example:

library(doParallel)
cl <- makePSOCKcluster(2)
registerDoParallel(cl)
X <- matrix(0, nrow=4, ncol=4)
clusterExport(cl, 'X')
ignore <- clusterApply(cl, seq_along(cl), function(i) ID <<- i)
results <-
  foreach(i=1:4, .noexport='X') %dopar% {
    X[i,] <<- ID
    X
  }
finalresults <- clusterEvalQ(cl, X)

results contains the matrices after each task, and finalresults contain the matrices on each of the workers after the foreach loop has completed.


Update

In general, the body of the foreach loop shouldn't modify any variable that is outside of the foreach loop. I only modify variables that I created previously in the same iteration of the foreach loop. If you want to make a modified version that is only used within that iteration, use a different variable name.