Calculate all distances between two set of points using st_distance

Question

I have two sets of points stored in R as sf objects. Point object x contains 204,467 and point y contains 5,297 points.

In theory, I would want to calculate the distance from all points in x to all points in y. I understand that this would create a beast of a matrix, but it is doable using st_distance(x, y, by_element=FALSE) in the sf package in about 40 minutes on my i7 desktop.

What I want to do is to calculate the distance from all of the points in x to all of the points in y, then I want to convert this into a data.frame, that contains all variables for the respective x and y pair of points. This is because I want flexibility in terms of aggregation using dplyr, for instance, I want to find the number of points in y, that is within 10, 50, 100 km from x, and where x$year < y$year.

I successfully created the distance matrix, which has around 1,083,061,699 cells. I know this is a very inefficient way of doing this, but it gives flexibility in terms of aggregation. Other suggestions are welcome.

Below is code to create two sf point objects, and measure the distance between them. Next, I would want to convert this into a data.frame with all variables from x and y, but this is where I fail to proceed.

If my suggested workflow is unfeasible, can someone provide an alternative solution to measure distance to all points within a predefined radius, and create a data.frame of the result with all variables from x and y?

# Create two sf point objects 
set.seed(123)
library(sf)


pts1 <- st_as_sf(x = data.frame(id=seq(1,204467,1),
                                year=sample(seq(from = 1990, to = 2018, by = 1), size = 204467, replace = TRUE),
                                xcoord=sample(seq(from = -180, to = 180, by = 1), size = 204467, replace = TRUE),
                                ycoord=sample(seq(from = -90, to = 90, by = 1), size = 204467, replace = TRUE)),
                 coords=c("xcoord","ycoord"),crs=4326)

pts2 <- st_as_sf(x = data.frame(id=seq(1,5297,1),
                                year=sample(seq(from = 1990, to = 2018, by = 1), size = 5297, replace = TRUE),
                                xcoord=sample(seq(from = -180, to = 180, by = 1), size = 5297, replace = TRUE),
                                ycoord=sample(seq(from = -90, to = 90, by = 1), size = 5297, replace = TRUE)),
                 coords=c("xcoord","ycoord"),crs=4326)

distmat <- st_distance(pts1,pts2,by_element = FALSE)

dww dww · Accepted Answer · 2018-10-19T00:35:46

I would consider approaching this differently. Once you have your distmat matrix, you can do the types of calculation you describe without needing a data.frame. You can use standard subsetting to find which points meet your specified criteria.

For example, to find the combinations of points where pts1$year is greater than pts2$year we can do:

subset_points = outer(pts1$year, pts2$year, `>`)

Then, to find how many of these are separated more than 100 km, we can do

library(units)
sum(distmat[subset_points] > (100 * as_units('km', 1)))

A note on memory usage

However you approach this with sf or data.frame objects, the chances are that you will start to bump up against RAM limits with 1e9 floating points in each matrix or column of a data.table. You might think about instead converting your distance matrix to a raster. Then the raster can be stored on disk rather than in memory, and you can utilise the memory-safe functions in the raster package to crunch your way through.

How we might use rasters to work from disk and save RAM

We can use memory-safe raster operations for the very large matrices like this, for example:

library(raster)

# convert our matrices to rasters, so we can work on them from disk
r = raster(matrix(as.numeric(distmat), length(pts1$id), length(pts2$id)))
s = raster(subset_points)
remove('distmat', 'subset_points')

# now create a raster equal to r, but with zeroes in the cells we wish to exclude from calculation
rs = overlay(r,s,fun=function(x,y){x*y}, filename='out1.tif')     

# find which cells have value greater than x (1e6 in the example)
Big_cells = reclassify(rs, matrix(c(-Inf, 1e6, 0, 1e6, Inf, 1), ncol=3, byrow=TRUE), 'out.tiff', overwrite=T)

# and finally count the cells
N = cellStats(Big_cells, sum)

Calculate all distances between two set of points using st_distance

1 Answers