0
votes

I have two data frames of different sizes containing geocodes. The first (df) has 12,000 observations and the second (schools) 3,000.

The first contains geocodes for properties in a country and the second for schools in the country.

I want to find the distance of the nearest school for each property. Using the geosphere package I'm currently working with the following:

library(geosphere)
for(i in 1:length(df$longitude)){
  df$dist2[i] <- distm(c(schools[1, 3], schools[1, 2]), c(df$longitude[i], df$latitude[i]), fun = distHaversine)  *0.001
}

where schools[, 3] and schools[, 2] are the longitude and latitude columns of that data frame respectively.

The above calculates the distance (in km) between all observations in df and the first school in schools.

I want to calculate the distance between each observation and all schools, saving only the smallest distance as that value for df$dist2[i].

2

2 Answers

1
votes

In the following example, I make up the longitude/latitude data on the points and the school.

library(tidyverse)
library(geosphere)

df_points  <- data.frame(lon = rnorm(10, mean =4, sd = 0.5), lat = rnorm(10, mean = 50, sd= 0.1))
df_schools <- data.frame(lon = rnorm( 3, mean =4, sd = 0.5), lat = rnorm( 3, mean = 50, sd= 0.1))

distm(df_points, df_schools, fun = distHaversine ) %>% 
  as.data.frame() %>% 
  rownames_to_column(var = "point_id") %>% 
  mutate(point_id = as.numeric(point_id)) %>%
  gather(key = school, value = distance, -point_id) %>%
  group_by(point_id) %>% 
  summarise(smalles_distance = min(distance))
1
votes

Here is an approach using sp class objects. You can coerce data.frame objects to SpatialPointsDataFrame objects using something along the lines of: coordinates(x) <- ~lon+lat The idea here is to derive a distance matrix between the two point feature classes and then pull the distance and ID based on the column name (assigned from the school data). This not only returns the distance but also a unique identifier for each school making it easy to query the actual closest school to any given property feature.

First, add the required libraries and create some example data.

library(sp)
library(raster)

e <- as(raster::extent(-180, 180, -90, 90), "SpatialPolygons")
  properties <- spsample(e, 1000, type="random")
    proj4string(properties) <- "+proj=longlat +ellps=WGS84"  
  schools <- spsample(e, 100, type="random")
    proj4string(schools) <- "+proj=longlat +ellps=WGS84"
      schools$ids <- paste0("school", 1:length(schools))   

Now, we can create the distance matrix, assign the diagonal to NA and add a unique identifier from schools to the column names of the matrix.

d <- spDists(x = properties, y = schools, longlat = TRUE)
  diag(d) <- NA
    colnames(d) <- schools$ids

There are certainly more elegant ways to do this but, for simplicity sake we will use a for loop to populate two vectors representing distance and ID's. We use which.min to pull the index for the minimum distance at row i. The iterator is based on the matrix rows because they represent the property features.

sdist <- rep(NA, nrow(d))
sid <- rep(NA, nrow(d))
  for(i in 1:nrow(d)) {
    srow <- d[i,]  
    sdist[i] <- srow[which.min(srow)]
    sid[i] <- names(srow)[which.min(srow)]
  }

We can then assign the resulting vectors to the properties SpatialPointsDataFrame. Now we have columns in the @data slot data.frame that represent distance to nearest school as well as the school ID's.

properties$school <- sid
properties$dist <- sdist

Here we can plot the results.

par(mfrow=c(2,1))  
  plot(properties, pch=19, cex=0.5)
    plot(schools, pch=19, col="red", add=TRUE)
      plot(e, add=TRUE)
        title("random properties (black) and schools (red)", cex=0.5) 
  plot(properties, col="white")
    plot(properties[1,], pch=19, cex=2, add=TRUE)
      plot(schools[which(schools$ids %in% properties[1,]$school),], 
         pch=19, cex=2, col="red", add=TRUE)
           plot(e, add=TRUE)
           title("Property 1 (black) and closest school (red)", cex=0.5)
     sidx <- which(schools$ids %in% properties[1,]$school)
       text(coordinates(schools[sidx,]), 
            label = schools[sidx,]$ids, col="blue", cex=1)