0
votes

I'm using foreach to parallelise a simple loop in some R code. Everything works fine, and I'm getting an acceptable speed-up - except that the output from the foreach loop is 'missing' some results, because (it seems) they are being duplicated. In other words, I'm presuming that the same piece of work is being sent to EACH worker before the loop increments, rather than pushing it to each worker as they become free.

I'm using doSNOW as the parallel backend (R version 2.15.3, foreach version 1.4.1, doSNOW version 1.0.9). The code is essentially as follows:

library(foreach)
library(doSNOW)
the.cores <- 2
cl <- makeCluster( rep ("localhost",the.cores), type="SOCK" )
registerDoSNOW(cl)

getRows <- function(fileToRead, numberOfRows, rowsToSkip){
return( read.csv(fileToRead, numberOfRows, rowsToSkip, stringsAsFactors=FALSE) )
}

doCalculation <- function(x){
# do some stuff with x
return(result)
}

calculationTest <- function(fileToRead, numberOfRows, rowsToSkip){
theData <- getRows(fileToRead, numberOfRows, rowsToSkip)
calcs <- doCalculation(theData)
return(result)
}


final.results <- foreach(i=1:n) %dopar% {
theResult <- lapply(aFile, calcTest, i=i, nrows=numberOfRows,  rowstoskip=rowsToSkip)
}

The issue comes with the results. I have 2 physical and 4 logical cores on my machine, and the outcome follows a similar pattern - namely, with n and the number of cores set as follows, the results are:

n = 6
the.cores <- 2
unlist(final.results)
1 1 2 2 3 3 

Similarly, for

n = 6
the.cores <- 4

I get

unlist(final.results)
1 1 1 1 2 2

The correct result, calculated in serial and checked manually, is:

unlist(final.results)
[1] 1 2 3 4 5

Everything else works fine: I'm just a little confused, as I assumed that the results would be pushed to each worker as they became free, and so the serial results should be replicated exactly. I also assumed that in this very simple example (it's only intended just to speed up some moderately-sized calculations a bit!) that it wouldn't be necessary to break the foreach loop into explicit blocks, for each worker: am I right in this thinking? Since the function in the lapply statement calls other functions, the first of which reads chunks of numeric values from a file, before calling other functions to perform calculations on the chunk, could this be where the issue is?

Finally, if I set

the.cores <- 1

to replicate the serial computation, the results are exactly correct - i.e.

unlist(final.results)
[1] 1 2 3 4 5

Any explanations to remedy my ignorance very much appreciated! :-)

EDIT: just to note, using the above code in the following test example, everything works just fine.

library(foreach)
library(doSNOW)
the.cores <- 2
cl <- makeCluster( rep ("localhost",the.cores), type="SOCK" )
registerDoSNOW(cl)

my.fun <- function(x) { x^2 }
the.output <- foreach(i=1:10) %dopar% {
my.fun(i)
}

gives the expected result:

[1]   1   4   9  16  25  36  49  64  81 100
1
You need to make your code reproducible to facilitate debugging.Roland

1 Answers

1
votes

I'm probably confused about what you want to do, but I'm guessing that you want to process a file in parallel with each worker reading its own chunk of the file. To do that, I would use two iteration variables in the foreach loop. Here's an example using a dummy calcTest function that simply returns the two key input arguments to demonstrate the technique:

library(doSNOW)
library(iterators)
the.cores <- 4
cl <- makeSOCKcluster(the.cores)
registerDoSNOW(cl)
totalRows <- 1000
nrows <- unlist(as.list(idiv(totalRows, chunks=the.cores)))
skip <- cumsum(c(0, nrows))[1:the.cores]
calcTest <- function(fileToRead, numberOfRows, rowsToSkip) {
  c(numberOfRows, rowsToSkip)
}
aFile <- 'file.dat'
final.results <- foreach(numberOfRows=nrows, rowsToSkip=skip) %dopar% {
  calcTest(aFile, numberOfRows, rowsToSkip)
}

When this is executed, final.results becomes:

> final.results
[[1]]
[1] 250   0

[[2]]
[1] 250 250

[[3]]
[1] 250 500

[[4]]
[1] 250 750

So the first worker processes lines 1-250, the second worker processes 251-500, etc. Is that basically what you want to do?