Parallelization using shared memory [bigmemory]

Question

I'm experiencing some difficulties when trying to make it work a parallel scenario [doSNOW], with involves the use of shared memory [bigmemory]. The summary is that I get the following error "Error in { : task 1 failed - "cannot open the connection"" in some of the foreach workers. More specifically, checking the cluster output log, it is related to "'/temp/x_bigmatrix.desc': Permission denied" like if there were some problem with a concurrent access to the big.matrix descriptor file.

Please excuse me, but because the code is a bit complex I'm not including a reproducible example but going to try to explain what's the workflow with the main points.

I have a matrix X, which is converted into big.matrix through:

x_bigmatrix <- as.big.matrix(x_matrix, 
                             type = "double", 
                             separated = FALSE,
                             backingfile = "x_bigmatrix.bin",
                             descriptorfile = "x_bigmatrix.desc",
                             backingpath = "./temp/")

Then, I initialize the sock cluster with doSNOW [I'm on Windows 10 x64]:

cl <- parallel::makeCluster(N-1, outfile= "output.log")
registerDoSNOW(cl)

(showConnections() shows properly the registered connections)

Now I have to explain that I have the main loop (foreach) for each worker and then, there is an inner loop where each worker loops over the rows in X. The main idea is that, within the main corpus, each worker is fed with chunks of data sequentially through the inner loop and then, each worker may store some of these observations but instead of store the observations themselves; they store the row indexes for posterior retrieval. In order to complicate even more the things, each worker modifies an associated R6 class environment where the indices are stored. I say this because the access to the big.matrix descriptor file takes places in two different places: the main foreach loop and within each R6 environment. The foreach main corpus is the following:

workersOutput <- foreach(worker = workers, .verbose = TRUE) %dopar% {
    # In the main foreach loop I don't include the .packages argument because 
    # I do source with all the needed libraries at the beginning of the loop
    source(libs)

    # I attach the big.matrix using attach.big.matrix and the descriptor file
    x_bigmatrix <- attach.big.matrix("./temp/x_bigmatrix.desc")

    # Now the inner for loop is going to loop over the rows in X.
    # Each worker can store some of these indices for posterior retrieval
    for (i in seq(1, nrow(X)) {
        # Each R6 object associated with each worker is modified, storing indices...
        # Within these environments, there is read-only access through a getter using the same
        # procedure that above: x_bigmatrix <- attach.big.matrix("./temp/x_bigmatrix.desc")
    }
}
stopCluster(cl)

The problem occurs in the inner loop when trying to access the big.matrix backed in file. Because if I change the behaviour in these environments to store explicitly the observations instead of the row indices (thus, there is no access to the descriptor file within these objects anymore), then it works without any problem. Also, if I run it without parallelization [registerDoSEQ()] but storing the row indices in the objects, there is also no errors. So the problem takes place if I mix parallelization and double accessing to the shared big.matrix within the different R6 environments. The weird thing is that some of the workers can run for longer time than others, and even in the end at least one finish its run... So that makes me think about the problem with a concurrent access to the big.matrix descriptor file.

Am I failing at some basics here?

Are you sure bigmemory can be parallelized like that? I'd suggest that you check with the package maintainer whether it was designed such that multiple processes can safely access the same file at the same time or not. Having multiple processes reading a static file is not that complicated, but if there's any type of updates to the file (by user or by package), then things becomes much much more complicated. — HenrikB
Apparently it should, isn't? stackoverflow.com/questions/31575585/… In my case, it does without adding all this stuff with R6 classes and environments... — zek
If the problem is a concurrent access to the big.matrix descriptor file, can't you just pass the descriptor object (with describe) rather than the descriptor file which contains the object? — F. Privé
Oh man, it's working! Still I don't fully comprehend why the attach procedure using the descriptor (from describe function) works, but using the full filebacked path does not.. Because in the end, the underlying behaviour is the same, isn't? But it does the trick, thank you! :) — zek
@F.Privé please place answer below so the OP can accept the answer and this question can be closed. — cdeterman

F. Privé F. Privé · Accepted Answer · 2017-05-20T17:57:11

If the problem is a concurrent access to the big.matrix descriptor file, you can just pass the descriptor object (with describe) rather than the descriptor file which contains the object.

Explanation: When attaching from the descriptor file, it first creates the big.matrix.descriptor object and then attach the big.matrix from this object. So, if you use directly the object, it will be copied to all your clusters and you can attach the big.matrix from them.

Parallelization using shared memory [bigmemory]

1 Answers