I'm experiencing some difficulties when trying to make it work a parallel scenario [doSNOW], with involves the use of shared memory [bigmemory]. The summary is that I get the following error "Error in { : task 1 failed - "cannot open the connection"" in some of the foreach workers. More specifically, checking the cluster output log, it is related to "'/temp/x_bigmatrix.desc': Permission denied" like if there were some problem with a concurrent access to the big.matrix descriptor file.
Please excuse me, but because the code is a bit complex I'm not including a reproducible example but going to try to explain what's the workflow with the main points.
I have a matrix X, which is converted into big.matrix through:
x_bigmatrix <- as.big.matrix(x_matrix,
type = "double",
separated = FALSE,
backingfile = "x_bigmatrix.bin",
descriptorfile = "x_bigmatrix.desc",
backingpath = "./temp/")
Then, I initialize the sock cluster with doSNOW [I'm on Windows 10 x64]:
cl <- parallel::makeCluster(N-1, outfile= "output.log")
registerDoSNOW(cl)
(showConnections() shows properly the registered connections)
Now I have to explain that I have the main loop (foreach) for each worker and then, there is an inner loop where each worker loops over the rows in X. The main idea is that, within the main corpus, each worker is fed with chunks of data sequentially through the inner loop and then, each worker may store some of these observations but instead of store the observations themselves; they store the row indexes for posterior retrieval. In order to complicate even more the things, each worker modifies an associated R6 class environment where the indices are stored. I say this because the access to the big.matrix descriptor file takes places in two different places: the main foreach loop and within each R6 environment. The foreach main corpus is the following:
workersOutput <- foreach(worker = workers, .verbose = TRUE) %dopar% {
# In the main foreach loop I don't include the .packages argument because
# I do source with all the needed libraries at the beginning of the loop
source(libs)
# I attach the big.matrix using attach.big.matrix and the descriptor file
x_bigmatrix <- attach.big.matrix("./temp/x_bigmatrix.desc")
# Now the inner for loop is going to loop over the rows in X.
# Each worker can store some of these indices for posterior retrieval
for (i in seq(1, nrow(X)) {
# Each R6 object associated with each worker is modified, storing indices...
# Within these environments, there is read-only access through a getter using the same
# procedure that above: x_bigmatrix <- attach.big.matrix("./temp/x_bigmatrix.desc")
}
}
stopCluster(cl)
The problem occurs in the inner loop when trying to access the big.matrix backed in file. Because if I change the behaviour in these environments to store explicitly the observations instead of the row indices (thus, there is no access to the descriptor file within these objects anymore), then it works without any problem. Also, if I run it without parallelization [registerDoSEQ()] but storing the row indices in the objects, there is also no errors. So the problem takes place if I mix parallelization and double accessing to the shared big.matrix within the different R6 environments. The weird thing is that some of the workers can run for longer time than others, and even in the end at least one finish its run... So that makes me think about the problem with a concurrent access to the big.matrix descriptor file.
Am I failing at some basics here?
describe
) rather than the descriptor file which contains the object? – F. Privé