0
votes

I have written the following code (running in RStudio for Windows) to read a long list of very large text files into memory using a parallel foreach loop:

open.raw.txt <- function() {
  files <- choose.files(caption="Select .txt files for import")
  cores <- detectCores() - 2
  registerDoParallel(cores)
  data <- foreach(file.temp = files[1:length(files)], .combine = cbind) %dopar% 
    as.numeric(read.table(file.temp)[, 4])
  stopImplicitCluster()
return(data)
}

Unfortunately, however, the function fails to complete and debugging shows that it gets stuck at the foreach loop stage. Oddly, windows task manager indicated that I am at close to full capacity processor wise (I have 32 cores, and this should use 30 of them) for around 10 seconds, then it drops back to baseline. However the loop never completes, indicating that it is doing the work and then getting stuck.

Even more bizarrely, if I remove the 'function' bit and just run each step one-by-one as follows:

files <- choose.files(caption="Select .txt files for import")
cores <- detectCores() - 2
registerDoParallel(cores)
data <- foreach(file.temp = files[1:length(files)], .combine = cbind) %dopar% 
  as.numeric(read.table(file.temp)[, 4])
stopImplicitCluster()

Then it all works fine. What is going on?

Update: I ran the function and then left it for a while (around an hour) and finally it completed. I am not quite sure how to interpret this, given that multiple cores are still only used for the first 10 seconds or so. Could the issue be with how the tasks are being shared out? Or maybe memory management? I'm new to parallelism, so not sure how to investigate this.

1
how are you calling the function?Alex W
data <- open.raw.txt()D Greenwood
Could you remove the call to stopImplicitCluster from the function to see if that's where it's hanging?Steve Weston
Thanks Steve. I tried that, and also adding a one-second sleep between the two (in case it was just getting carried away with itself) but no luck. I also thought the issue might be in writing to 'data' perhaps because of permissions. I don't know very much about how permissions work on Windows, but I tried running R/RStudio as an administrator, but it hasn't made a difference.D Greenwood
Just posted an update.D Greenwood

1 Answers

-1
votes

The problem is that you have multiple process opening and closing the same file. Usually when a file is opened by a process it is locked to other process, so that prevents reading the file in parallel