I have written the following code (running in RStudio for Windows) to read a long list of very large text files into memory using a parallel foreach loop:
open.raw.txt <- function() {
files <- choose.files(caption="Select .txt files for import")
cores <- detectCores() - 2
registerDoParallel(cores)
data <- foreach(file.temp = files[1:length(files)], .combine = cbind) %dopar%
as.numeric(read.table(file.temp)[, 4])
stopImplicitCluster()
return(data)
}
Unfortunately, however, the function fails to complete and debugging shows that it gets stuck at the foreach loop stage. Oddly, windows task manager indicated that I am at close to full capacity processor wise (I have 32 cores, and this should use 30 of them) for around 10 seconds, then it drops back to baseline. However the loop never completes, indicating that it is doing the work and then getting stuck.
Even more bizarrely, if I remove the 'function' bit and just run each step one-by-one as follows:
files <- choose.files(caption="Select .txt files for import")
cores <- detectCores() - 2
registerDoParallel(cores)
data <- foreach(file.temp = files[1:length(files)], .combine = cbind) %dopar%
as.numeric(read.table(file.temp)[, 4])
stopImplicitCluster()
Then it all works fine. What is going on?
Update: I ran the function and then left it for a while (around an hour) and finally it completed. I am not quite sure how to interpret this, given that multiple cores are still only used for the first 10 seconds or so. Could the issue be with how the tasks are being shared out? Or maybe memory management? I'm new to parallelism, so not sure how to investigate this.
stopImplicitCluster
from the function to see if that's where it's hanging? – Steve Weston