1
votes

I am trying to parallelize some function on the 4 cores of my machine using parLapply. My function defines two embedded loops which are meant to fill out some empty columns of a predefined matrix M. However, when I run the code below I obtain the following error

2 nodes produced errors; first error: incorrect number of dimensions 

Code:

require("parallel")
TheData<-list(E,T)        # list of 2 matrices of different dimensions, T is longer and wider than E

myfunction <- function(TheData) {
for (k in 1:length(TheData[[1]][,1])) {
    distance<-matrix(,nrow=length(TheData[[1]][,1]),ncol=1)
     for (j in 1:length(TheData[[2]][,1])) {
    distance[j]<-sqrt((as.numeric(TheData[[2]][j,1])-as.numeric(TheData[[1]][k,2]))^2+(as.numeric(TheData[[2]][j,2])-as.numeric(TheData[[1]][k,1]))^2)              
    }         
    index<-which(distance == min(distance))
    M[k,4:9]<-c(as.numeric(TheData[[2]][index,1]),as.numeric(TheData[[2]][index,2]),as.numeric(TheData[[2]][index,3]),as.numeric(TheData[[2]][index,4]),as.numeric(TheData[[2]][index,5]),as.numeric(TheData[[2]][index,6]))   
rm(distance)
gc() 
}  
}
n_cores <- 4
Cl = makeCluster(n_cores)
Results <- parLapplyLB(Cl, TheData, myfunction)
# I also tried: Results <- parLapply(Cl, TheData, myfunction)
1

1 Answers

1
votes

In your example, parLapply is iterating over a list of matrices, and passing those matrices as the argument to "myfunction". However, "myfunction" seems to expect its argument to be a list of two matrices, and so an error occurs. I can reproduce that error with:

> E <- matrix(0, 4, 4)
> E[[1]][,1]
Error in E[[1]][, 1] : incorrect number of dimensions

I'm not sure what you're really trying to do, but with the current implementation of "myfunction", I would expect you to call parLapply with a list of lists containing two matrices, such as:

TheDataList <- list(list(A,B), list(C,D), list(E,F), list(G,H))

Passing this as the second argument to parLapply would result in "myfunction" being called four times, each time with a list containing two matrices.

But your example has another problem. It looks like you expect parLapply to modify the matrix "M" as a side-effect, but it can't. I think you should change "myfunction" to return a matrix. parLapply will return the matrices in a list which you can then bind together into the desired result.


Update

From your comment, I now believe that you essentially want to parallelize "myfunction". Here's my attempt to do that:

library(parallel)
cl <- makeCluster(4)

myfunction <- function(Exy) {
  iM <- integer(nrow(Exy))
  for (k in 1:nrow(Exy)) {
    distance <- sqrt((Txy[,1] - Exy[k,2])^2 + (Txy[,2] - Exy[k,1])^2)
    iM[k] <- which.min(distance)
  }
  iM
}

# Random example data for testing
T <- matrix(rnorm(150), 10)
E <- matrix(rnorm(120), 10)

# Only export the first two columns to T to the workers
Txy <- T[,1:2]
clusterExport(cl, c('Txy'))

# Parallelize "myfunction" by calling it in parallel on block rows of "E".
ExyList <- parallel:::splitRows(E[,1:2], length(cl))
iM <- do.call('c', clusterApply(cl, ExyList, myfunction))

# Update "M" using data from "T" indexed by "iM"
M <- matrix(0, nrow(T), 9)  # more fake data
for (k in iM) {
  M[k,4:9] <- T[k, 1:6]
}
print(M)

stopCluster(cl)

Notes:

  • I vectorized myfunction which should make it more efficient. Hopefully it's nearly correct.
  • I also modified myfunction to return a vector of indices into "T" to reduce the amount of data sent back to the master.
  • The splitRows function from the parallel package is used to split the first two columns of "E" into a list of submatrices.
  • splitRows isn't exported by parallel, so I used ':::'. If this offends you, then use the splitRows function from snow which is exported.
  • The first two columns of "T" are exported to each of the workers since each task requires the entire first two columns.
  • clusterApply is used rather than parLapply since we need to iterate over submatrices of E.