3
votes

I try to compare up to thousands of estimated beta distributions. Each beta distribution is characterized by the two shape parameters alpha & beta. I now draw 100,000 samples of every distribution. As a final result I want to get an order of the distributions with the highest Probability in every sample draw. My first approach was to use lapply for generating a matrix of N * NDRAWS numeric values which was consuming too much memory as N gets beyond 10,000. (10,000 * 100,000 * 8 Bytes)

So I decided to use a sequential approach of ordering every single draw, then cumsum the order of all draws and get the final order as shown in the example below:

set.seed(12345)
N=100
NDRAWS=100000
df <- data.frame(alpha=sample(1:20, N, replace=T), beta=sample(1:200, N, replace=T))  

vec    <- vector(mode = "integer", length = N )

for(i in 1:NDRAWS){
  # order probabilities after a single draw for every theta
  pos <- order(rbeta(N, shape1=df$alpha, shape2=df$beta) )

  # sum up winning positions for every theta
  vec[pos] <- vec[pos] + 1:N
}

# order thetas
ord <- order(-vec)

df[ord,]

This is only consuming N * 4 Bytes of memory, as there is no giant matrix but a single vector of length N. My Question now is, how to speed up this operation using snowfall (or any other multicore package) by taking advantage of my 4 CPU Cores, instead of using just one core???

# parallelize using snowfall pckg
library(snowfall)
sfInit( parallel=TRUE, cpus=4, type="SOCK")
sfLapply( 1:NDRAWS, function(x) ?????? )
sfStop()

Any help is appreciated!

2
Where is the problem? Using the actual package?PascalVKooten
The Question is how to tranfsform the for loop into sfLapply, so that the final ordered data.frame is identical to that one in my example?Christian Borck
This confuses me: vec[pos] <- vec[pos] + 1:N. You try to put a vector into one value?PascalVKooten
I know it's a little confusing, maybe there is a more straight forward approach. What I do is adding position slots of a single ordered draw for every 1:N draws.Christian Borck
Ah, well. The way I now see it is that it is not sequential if you are going to add up the results to one single item in each iteration (which is not shared accross memory)?PascalVKooten

2 Answers

1
votes

This can be parallelized in the same way that one would parallelize random forest or bootstrapping. You just perform the sequential code on each of the workers but with each using a smaller number of iterations. That is much more efficient than splitting each iteration of the for loop into a separate parallel task.

Here's your complete example converted to use the foreach package with the doParallel backend:

set.seed(12345)
N=100
NDRAWS=100000
df <- data.frame(alpha=sample(1:20, N, replace=T),
                 beta=sample(1:200, N, replace=T))
library(doParallel)
nworkers <- detectCores()
cl <- makePSOCKcluster(nworkers)
clusterSetRNGStream(cl, c(1,2,3,4,5,6,7))
registerDoParallel(cl)

vec <- foreach(ndraws=rep(ceiling(NDRAWS/nworkers), nworkers),
               .combine='+') %dopar% {
  v <- integer(N)
  for(i in 1:ndraws) {
    pos <- order(rbeta(N, shape1=df$alpha, shape2=df$beta) )
    v[pos] <- v[pos] + 1:N
  }
  v
}
ord <- order(-vec)
df[ord,]

Note that this gives different results than the sequential version because different random numbers are generated by the workers. I used the parallel random number support provided by the parallel package since that is good practice.

0
votes

Well, the functionality is there. I'm not sure though what you'd be returning with each iteration.

Perhaps try this?

myFunc <- function(xx, N) {
  pos <- order(rbeta(N, shape1=df$alpha, shape2=df$beta) )
  vec[pos] + 1:N
}

Using doParallel will allow you to add results:

require(doParallel)
registerDoParallel(cores=4)
foreach(i=1:NDRAWS, .combine='+') %dopar% myFunc(i, N)