I have an enormous dataset of 1.5 billion spatial lines that I've created using all combinations of 37,000 points. For each spatial line, I would like to extract the maximum value of the polygon (or raster - whatever's faster) that the line touches. Essentially this is a very large "spatial join" in Arc lingo. If overlaying lines on the polygon layer, output would be the maximum value of the spatial line at all attribute fields - each representing one month of one year. I've included a raster dataset as well, which was created from only January 1990 of the polygon file at ~30m resolution - the raster represents an alternative approach that I thought might save time. The polygon & raster layers represent a large spatial area: roughly 30km x 10km. Data are available here. The spatial lines dataset I've included in the .zip has only 9900 lines, sampled at random from the whole dataset of 1.5 billion lines.
First read in the data
#polygons
poly<-readShapePoly("ls_polys_bin",proj4string=CRS("+proj=utm +zone=21 +south +datum=WGS84 +units=m +no_defs"))
poly$SP_ID<-NULL #deleting this extra field in prep for overlay
#raster - this represents only one month (january 1990)
#raster created from polygon layer but one month only
raster.jan90<-readGDAL("rast_jan90.tif")
raster.jan90<-raster(raster.jan90) #makes it into a raster
#lines (9900 of 1.5 billion included)
lines<-readShapeLines("l_spatial",proj4string=CRS("+proj=utm +zone=21 +south +datum=WGS84 +units=m +no_defs"))
To make the line data more manageable, take sample of 50 lines
lines.50<-lines[sample(nrow(lines),50),]
Plot all three layers together
plot(raster.jan90)#where green=1
plot(poly, axes=T,cex.axis=0.75, add=T)
plot(lines.50, col="red", add=TRUE)
First I tried an overlay, but at current rate, the entire dataset of 1.5 billion would take about 844 days to run on my machine
ptm <- proc.time() #start clock
overlays.all<-over(lines.50,poly, fn=max)
ptm.sec.overlay<-proc.time() - ptm # stop clock
ptm.sec.overlay #.56 sec w/ n=12 lines; 2.3 sec w/ 50 lines
Next I converted the polygons to a raster (one month only - Jan 1990), and I ran an extract() with the spatial lines, but this took even more time.
ptm <- proc.time() # Start clock
ext.rast.jan90<-extract(raster.jan90,lines.50, fun=max, method=simple)
ptm.sec.ext<-proc.time() - ptm # stop clock
ptm.sec.ext #32 sec w/ n=12 lines; 191 sec w/ n=50 lines
My attempts to convert all the "0" cells to "NA" did not appear to save time. Is there another way to do this monstrous overlay or extract() more efficiently? Note that these data are currently binned as "1" or "0", but eventually I would like to run this code for a continuous variable that runs 0:300.