Improve speed/use of gDistance function by using parallel processing and/or plyr/dplyr?

Question

I have a bunch of locations for each of about 1000 individuals. The total dataset used to be around 2.5 million and my processing script took about 20 hours to run. Now however, I have 24 million observations and figure I need to clean up my code and perhaps use parallel processing if I can.

For each point, I want to find the closest polygon (most points are not in a polygon) and the distance of that polygon. The points are mostly observations at sea and the polygons are the coastal (U.S.) counties nearest the points.

I have been doing this using the gDistance function in the rgeos package and have been running a series of loops (I know, I know) in order to break up the processing by each of my individuals. I have spent a lot of time trying to figure out how to move this into the plyr/dplyr syntax somehow but can't quite get it. Part of my issue, I assume has to do with my object classes being SpatialPoint and SpatialPoylgonDataFrames.

library(sp)
library(rgeos)
library(plyr)
#  Create SpatialPointsDataFrame
#  My actual dataset has 24 million observations
my.pts <- data.frame(LONGITUDE=c(-85.4,-84.7,-82.7,-82.7,-86.5,-88.9,-94.8,-83.9,-87.8,-82.8),
             LATITUDE=c(30.0,29.9,27.5,28.5,30.4,26.1,29.3,28.0,29.4,27.8),
             MYID=c(1,1,2,2,2,2,3,4,4,4),
             INDEX=1:10)
coordinates(my.pts) <- c("LONGITUDE","LATITUDE")

#  Create two polygons in a SpatialPolygonsDataFrame
#  My actual dataset has 71 polygons (U.S. counties)
x1 <- data.frame(x=c(-92.3, -92.3, -90.7, -90.7, -92.3, -92.3),y=c(27.6, 29.4, 29.4, 27.6, 27.6, 27.6))
x1 <- as.data.frame(x1) 
x1 <- Polygon(rbind(x1,x1[1,]))

x2 <- data.frame(x=c(-85.2, -85.2, -83.3, -83.2, -85.2, -85.2),y=c(26.4, 26.9, 26.9, 26.0, 26.4,     26.4))
x2 <- as.data.frame(x2) 
x2 <- Polygon(rbind(x2,x2[1,]))

poly1 <- Polygons(list(x1),"poly1")
poly2 <- Polygons(list(x2),"poly2")
myShp <- SpatialPolygons(list(poly1,poly2),1:2)
sdf <- data.frame(ID=c(1,2))
row.names(sdf) <- c("poly1","poly2")
 myShp <- SpatialPolygonsDataFrame(myShp,data=sdf)

   #  I have been outputting my results to a list. With this small sample, it's easy to just put everything into the object county.vec. But I worry that the 24 million x 71 object would not be feasible. The non-loop version shows the output I've been getting more easily.

    COUNTY.LIST <- list()
    county.vec <- gDistance(my.pts, myShp, byid=TRUE)
    COUNTY.LIST[[1]] = apply(county.vec, 2, min)
    COUNTY.LIST[[2]] = apply(county.vec, 2, which.min)
    COUNTY.LIST[[3]] = my.pts$INDEX

# I have been putting it into a loop so that county.vec gets dumped for each version of the loop.
# Seems like this could be done using dlply perhaps? And then I would have the power of parallel processing?
idx <- unique(my.pts$MYID)
COUNTY.LIST <- list()
for(i in 1:length(idx)){
    COUNTY.LIST[[i]] <- list()
    county.vec <- gDistance(my.pts[my.pts$MYID==idx[i],], myShp, byid=TRUE)
    COUNTY.LIST[[i]][[1]] = apply(county.vec, 2, min)
    COUNTY.LIST[[i]][[2]] = apply(county.vec, 2, which.min)
    COUNTY.LIST[[i]][[3]] = my.pts$MY[my.pts$MYID==idx[i]]
    rm(county.vec)
}

dlply(my.pts,.(MYID),gDistance(my.pts, myShp, byid=TRUE),.parallel=TRUE)
> dlply(my.pts,.(MYID),gDistance(my.pts, myShp, byid=TRUE))
Error in eval.quoted(.variables, data) : 
envir must be either NULL, a list, or an environment.

# I suspect this error is because my.pts is a SpatialPointsPolygon.  I also recognize that my function call probably isn't right, but first things first.

# I tried another way to reference the MYID field, more inline with treatment of S4 objects...
dlply(my.pts,my.pts@data$MYID,gDistance(my.pts, myShp, byid=TRUE),.parallel=TRUE)

# It yields the same error.

I'd be grateful for any suggestions folks might have.

Can dlply() handle a SpatialPointsDataFrame? I don't know. — Steven
I'm guessing that it can't but I shudder to make such an assertion on SO! People always come up with such clever responses though, I wouldn't be surprised if there was a workaround or just a much more appropriate way to go about it. — Jordan

juanbretti juanbretti · Accepted Answer · 2015-08-11T06:56:09

This is an old question, but maybe my simple method helps others.
It's using parallels. I'm writing a general example. It will not run the previous data question.

set.seed(1)
#Create the clusters
library(doParallel)
cl <- makeCluster(detectCores()) 
registerDoParallel(cl)
#Export the environment variables to each cluster
clusterExport(cl,ls())
#Load the library "rgeos" to each cluster
clusterEvalQ(cl, library(rgeos))
#Split the data
ID.Split<-clusterSplit(cl,unique(poly1$ID))
#Define a function that calculates the distance of one ID in relation to the poly2
a<-function(x) gDistance(spgeom1 = poly1[x,], spgeom2 = poly2, byid=TRUE)
#Run the function in each cluster
system.time(m<-clusterApply(cl, x=ID.Split, fun=a))
#Cluster close
stopCluster(cl)
#Merge the results
output<- do.call("cbind", m)

I hope this helps.

Improve speed/use of gDistance function by using parallel processing and/or plyr/dplyr?

1 Answers