I try to compare up to thousands of estimated beta distributions. Each beta distribution is characterized by the two shape parameters alpha & beta. I now draw 100,000 samples of every distribution. As a final result I want to get an order of the distributions with the highest Probability in every sample draw. My first approach was to use lapply for generating a matrix of N * NDRAWS numeric values which was consuming too much memory as N gets beyond 10,000. (10,000 * 100,000 * 8 Bytes)
So I decided to use a sequential approach of ordering every single draw, then cumsum the order of all draws and get the final order as shown in the example below:
set.seed(12345)
N=100
NDRAWS=100000
df <- data.frame(alpha=sample(1:20, N, replace=T), beta=sample(1:200, N, replace=T))
vec <- vector(mode = "integer", length = N )
for(i in 1:NDRAWS){
# order probabilities after a single draw for every theta
pos <- order(rbeta(N, shape1=df$alpha, shape2=df$beta) )
# sum up winning positions for every theta
vec[pos] <- vec[pos] + 1:N
}
# order thetas
ord <- order(-vec)
df[ord,]
This is only consuming N * 4 Bytes of memory, as there is no giant matrix but a single vector of length N. My Question now is, how to speed up this operation using snowfall (or any other multicore package) by taking advantage of my 4 CPU Cores, instead of using just one core???
# parallelize using snowfall pckg
library(snowfall)
sfInit( parallel=TRUE, cpus=4, type="SOCK")
sfLapply( 1:NDRAWS, function(x) ?????? )
sfStop()
Any help is appreciated!
vec[pos] <- vec[pos] + 1:N
. You try to put a vector into one value? – PascalVKooten