4
votes

I am intending to extract representative samples from populations (a,b,c,d,... see below) using the "clhs" package in R. The sampling process takes very long on my (multicore) computer, so I'd like to run the sampling procedures in parallel (using multiple CPU cores simultaneously).

These are some of my (example) data frames ("populations") from which I want to draw the samples:

a <- as.data.frame(replicate(1000, rnorm(20)))
b <- as.data.frame(replicate(1000, rnorm(20)))
c <- as.data.frame(replicate(1000, rnorm(20)))
d <- as.data.frame(replicate(1000, rnorm(20)))

The clhs code I want to run is:

clh_a <- clhs(x=a, size=round(nrow(a)/5), iter=2000, simple=F)) # 20% of all samples should be selected
clh_b <- clhs(x=b, size=round(nrow(b)/5), iter=2000, simple=F))

etc...

What is the way to run this sampling process in parallel? Or is there another way of doing this in an efficient manner?

Addendum (many thanks to "zipfzapf"):

I was trying to use "parLapply" - unfortunately, at the end, R is throwing an error message saying: "Error in length(x): 'x' is missing", which I honestly don't understand... Any ideas?

My code:

    library("snow")
            a <- as.data.frame(replicate(1000, rnorm(20)))
            b <- as.data.frame(replicate(1000, rnorm(20)))
            c <- as.data.frame(replicate(1000, rnorm(20)))
    d <- as.data.frame(replicate(1000, rnorm(20)))
    abcd <- list(a, b, c, d)
    cl <- makeCluster(4)
    results <- parLapply(cl,
       X = abcd,
       FUN = function(i) {
         clhs(x = i, size = round(nrow(i) / 5), iter = 2000, simple = FALSE)
       },
    )
3

3 Answers

3
votes

This works for me (notice I changed the number of iterations to make things move along at a reasonable pace).

library(snowfall)
sfInit(parallel = TRUE, cpus = 4, type = "SOCK")
sfLibrary(clhs)

x <- sfLapply(abcd, fun = function(x) {
            clhs(x = x, size=round(nrow(x)/5), iter = 200, simple =FALSE)
        })

     Length Class       Mode
[1,] 5      cLHS_result list
[2,] 5      cLHS_result list
[3,] 5      cLHS_result list
[4,] 5      cLHS_result list
1
votes

The function mclapply from the (builtin) package parallel is a multi-core version of lapply:

library(parallel)

# population samples
abcd <- list(a, b, c, d)

# multi-core version of 'lapply(abcd, [....])'
results <- parallel::mclapply(
  X = abcd,
  FUN = function(elem) {
    clhs(x = elem, size = round(nrow(elem) / 5), iter = 2000, simple = FALSE))
  },
  mc.preschedule = FALSE,
  mc.cores = 4L
)

This will give you a list where each element contains the result of the corresponding clhs call.

Note that arguments mc.preschedule and mc.cores are optional. Setting mc.preschedule to FALSE is a good idea if each function call of FUN may take a while (as in your case).

0
votes

A solution using "snow" - simply adding "clusterEvalQ(cl, library(clhs))" did the trick:

a <- as.data.frame(replicate(1000, rnorm(20)))
b <- as.data.frame(replicate(1000, rnorm(20)))
c <- as.data.frame(replicate(1000, rnorm(20)))
d <- as.data.frame(replicate(1000, rnorm(20)))
abcd <- list(a, b, c, d)
library("snow")
cl <- makeCluster(4)
clusterEvalQ(cl, library(clhs))
results <- parLapply(cl, abcd, fun = function(elem) {
    clhs(x = elem, size = round(nrow(elem) / 2), iter = 50)
  })
stopCluster(cl)

Many thanks again to zipfzapf & Roman Luštrik!