2
votes

I tried to look for a duplicate question and I know many people have asked about parLapply in R so I apologize if I missed one that is applicable to my situation.

Problem: I have the following function that runs correctly in R but when I try to run it in parallel using parLapply (I'm on a windows machine) I get the error that $ operator is invalid for atomic vectors. The error mentions that 3 nodes produced the errors no matter how many nodes I set my cluster at, for example I have 8 cores on my desktop so I set the cluster to 7 nodes. Here is example code showing where the problem is:

library(parallel)
library(doParallel)
library(arrangements)

#Function

 perms <- function(inputs)
  {
    x <- 0
    L <- 2^length(inputs$w)
     ip <- inputs$ip
    for( i in 1:L)
    {
      y <- ip$getnext()%*%inputs$w
      if (inputs$t >= y)
      {
        x <- x + 1
      }
    }
    return(x)
  }

#Inputs is a list of several other variables that are created before this 
#function runs (w, t_obs and iperm), here is a reproducible example of them
#W is derived from my data, this is just an easy way to make a reproducible example


  set.seed(1)
  m <- 15
  W <- matrix(runif(15,0,1))
  iperm <- arrangements::ipermutations(0:1, m, replace = T)
  t_obs <- 5

  inputs <- list(W,t_obs, iperm)
  names(inputs) <- c("w", "t", "ip")

#If I run the function not in parallel
perms(inputs)

#It gives a value of 27322 for this example data

This runs exactly as it should, however when I try the following to run in parallel I get an error

#make the cluster
  cor <- detectCores()
  cl<-makeCluster(cor-1,type="SOCK")

#passing library and arguments
  clusterExport(cl, c("inputs"))
  clusterEvalQ(cl, {
    library(arrangements)
  })

  results <- parLapply(cl, inputs, perms)


I get the error:

Error in checkForRemoteErrors(val) : 
  3 nodes produced errors; first error: $ operator is invalid for atomic vectors

However I've checked to see if anything is an atomic vector using is.atomic(), and using is.recursive(inputs) it says this is TRUE.

My question is why am I getting this error when I try to run this using parLapply when the function otherwise runs correctly and is there a reason is says "3 nodes produced errors" even when I have 7 nodes?

3
Perhaps a typo, but you never define m used in ipermutations.r2evans
@r2evans yes a typo, m is defined as 15 elsewhere in the code, I've added that. I don't think I would need to pass that in clusterExport since only iperm depends on it and I pass iperm because it is a part of inputs.RAND

3 Answers

2
votes

It says "3 nodes" because, as you're passing it to parLapply, you are only activating three nodes. The first argument to parLapply should be a list of things, each element to pass to each node. In your case, your inputs is a list, correct, but it is being broken down, such that your three nodes are effectively seeing:

# node 1
perms(inputs[[1]]) # effectively inputs$w
# node 2
perms(inputs[[2]]) # effectively inputs$t
# node 3
perms(inputs[[3]]) # effectively inputs$ip
# nodes 4-7 idle

You could replicate this on the local host (not parallel) with:

lapply(inputs, perms)

and when you see it like that, perhaps it becomes a little more obvious what is being passed to your nodes. (If you want to see if further, do debug(perms) then run the lapply above, and see what the inputs inside that function call looks like.)

To get this to work once on one node (I think not what you're trying to do), you could do

parLapply(cl, list(inputs), perms)

But that's only going to run one instance on one node. Perhaps you would prefer to do something like:

parLapply(cl, replicate(7, inputs, simplify=FALSE), perms)
0
votes

I'm adding an answer in case anyone with a similar problem comes across this. @r2evans answered my original question which lead to a realization that even fixing the above problems would not get me the desired result (see comments to his answer).

Problem: Using the package arrangements to generate a large number of combinations and apply a function to the combinations. This becomes very time consuming as the number of combinations gets huge. What we need to do is split the combinations into chunks depending on the number of cores you will using to run in parallel and then do the calculations in each node only on that specific chunk of the combinations.

Solution:


cor <- detectCores()-1
cl<-makeCluster(cor,type="SOCK")

set.seed(1)
m <- 15
W <- matrix(runif(15,0,1))
#iperm <- arrangements::ipermutations(0:1, m, replace = T)
t_obs <- 5
chunk_list <- list()
for (i in 1:cor)
{
  chunk_list[i] <- i

}
chunk_size <- floor((2^m)/(cor))
chunk_size <- c(rep(chunk_size,cor-1), (2^m)-chunk_size*(cor-1))

inputs_list <- Map(list, t=list(t_obs), w=list(W), chunk_list = chunk_list, chunk_size = list(chunk_size))

#inputs <- list(W,t_obs, iperm)
#names(inputs) <- c("w", "t", "ip", "chunk_it")




perms <- function(inputs)
{
  x <- 0
  L <- 2^length(inputs$w)
  ip <- arrangements::ipermutations(0:1, m, replace = T)

  chunk_size <- floor((2^m)/(cor))
  chunk_size <- c(rep(chunk_size,cor-1), (2^m)-chunk_size*(cor-1))

 if (inputs$chunk_list !=1)
  {
    ip$getnext(sum(chunk_size[1:inputs$chunk_list-1]))

  }


  for( i in 1:chunk_size[inputs$chunk_list])
  {
    y <- ip$getnext()%*%inputs$w
    if (inputs$t >= y)
    {
      x <- x + 1
    }

  }
  return(x)
}




clusterExport(cl, c("inputs_list", "m", "cor"))
clusterEvalQ(cl, {
  library(arrangements)
})

system.time(results <- parLapply(cl, inputs_list, perms))
Reduce(`+`, results)

What I did was split the total number of combinations up into different chunks, i.e. the first 4681 (I have 7 nodes assigned to cor), the second and so on and made sure I didn't miss any combinations. Then I changed my original function to generate the permutations in each node but to basically skip to the combination it should start calculating on, so for node 1 it starts with the first combination but for node it it starts with the 4682 and so on. I'm still working on optimizing this because it's currently only about 4 times as fast as running it in parallel even though I'm using 7 cores. I think the skip in the permutation option will speed this up but I haven't checked yet. Hopefully this is helpful to someone else, it speeds up my estimated time to run (with m = 25, not 15) a simulation from about 10 days to about 2.5 days.

0
votes

You need to pass dplyr to the nodes to solve this

clusterEvalQ(clust,{library (dplyr)})

The above code should solve your issue.