More efficiently overlay polygon or extract() raster data from spatial lines

Question

I have an enormous dataset of 1.5 billion spatial lines that I've created using all combinations of 37,000 points. For each spatial line, I would like to extract the maximum value of the polygon (or raster - whatever's faster) that the line touches. Essentially this is a very large "spatial join" in Arc lingo. If overlaying lines on the polygon layer, output would be the maximum value of the spatial line at all attribute fields - each representing one month of one year. I've included a raster dataset as well, which was created from only January 1990 of the polygon file at ~30m resolution - the raster represents an alternative approach that I thought might save time. The polygon & raster layers represent a large spatial area: roughly 30km x 10km. Data are available here. The spatial lines dataset I've included in the .zip has only 9900 lines, sampled at random from the whole dataset of 1.5 billion lines.

First read in the data

#polygons

 poly<-readShapePoly("ls_polys_bin",proj4string=CRS("+proj=utm +zone=21 +south +datum=WGS84 +units=m +no_defs"))
 poly$SP_ID<-NULL #deleting this extra field in prep for overlay

#raster - this represents only one month (january 1990)
   #raster created from polygon layer but one month only

     raster.jan90<-readGDAL("rast_jan90.tif") 
     raster.jan90<-raster(raster.jan90) #makes it into a raster

#lines (9900 of 1.5 billion included)

     lines<-readShapeLines("l_spatial",proj4string=CRS("+proj=utm +zone=21 +south +datum=WGS84 +units=m +no_defs"))

To make the line data more manageable, take sample of 50 lines

 lines.50<-lines[sample(nrow(lines),50),]

Plot all three layers together

plot(raster.jan90)#where green=1
plot(poly, axes=T,cex.axis=0.75, add=T)
plot(lines.50, col="red", add=TRUE)

First I tried an overlay, but at current rate, the entire dataset of 1.5 billion would take about 844 days to run on my machine

 ptm <- proc.time() #start clock
 overlays.all<-over(lines.50,poly, fn=max)
 ptm.sec.overlay<-proc.time() - ptm # stop clock
 ptm.sec.overlay #.56 sec w/ n=12 lines; 2.3 sec w/ 50 lines

Next I converted the polygons to a raster (one month only - Jan 1990), and I ran an extract() with the spatial lines, but this took even more time.

 ptm <- proc.time() # Start clock
 ext.rast.jan90<-extract(raster.jan90,lines.50, fun=max, method=simple)
 ptm.sec.ext<-proc.time() - ptm # stop clock
 ptm.sec.ext #32 sec w/ n=12 lines; 191 sec w/ n=50 lines

My attempts to convert all the "0" cells to "NA" did not appear to save time. Is there another way to do this monstrous overlay or extract() more efficiently? Note that these data are currently binned as "1" or "0", but eventually I would like to run this code for a continuous variable that runs 0:300.

All pairs of 37,000 points (not counting zero-length lines A-A) should give you only 684,481,500 lines to check since A-B hits the same polys as B-A. So that's 422 days... — Spacedman

Robert Hijmans Robert Hijmans · Accepted Answer · 2015-07-02T22:42:03

Here is a hack that should give a good approximation. It could probably be improved (getCrds takes a lot of time), including by taking larger steps (whether that is OK for you or not, I do not know).

library(raster)
raster.jan90 <- raster("rast_jan90.tif") 
lines <- shapefile("l_spatial.shp", p4s="+proj=utm +zone=21 +south +datum=WGS84 +units=m +no_defs")  
lines.50<-lines[sample(nrow(lines),50),]

test <- function(lns) {

  getCrds <- function(i) {
    p <- z[[i]][[1]]
    s <- (p[2,] - p[1,]) / res(raster.jan90)
    step <- round(max(abs(s)))
    if ( step < 1 ) {
        # these probably should not exist, but they do
        return( cbind(i, cellFromXY(raster.jan90, p[1, , drop=FALSE])) )
    }
    x <- seq(p[1,1], p[2,1], length.out=step)
    y <- seq(p[1,2], p[2,2], length.out=step)
    cbind(i, unique(cellFromXY(raster.jan90, cbind(x, y))))
  }

  z <- coordinates(lns)
  crd <- sapply(1:length(z), getCrds )
  crd <- do.call(rbind, crd)

  e <- extract(raster.jan90, crd[, 2])
  tapply(e, crd[,1], max)
}

system.time(res <- test(lines.50))
#  user  system elapsed 
#  0.53    0.01    0.55 

system.time(res <- test(lines))
#  user  system elapsed 
#  59.72    0.85   60.58

(684481500 * 60.58 / length(lines)) / (3600 * 24) is about 50 days...

Only 1 day on 50 computers

Note that it gets relatively more efficient with more lines (as there are relatively fewer unique cells to be queried).

More efficiently overlay polygon or extract() raster data from spatial lines

2 Answers