How to efficiently calculate distance between GPS points in one dataset and GPS points in another data set using data.table

Question

I am facing a coding (optimization) problem in R. I have a long data set with GPS coordinates (lon, lat, timestamp) and for every row I need to check whether the location is near a bus stop. I have a .csv file with all the bus stops (in the Netherlands). The GPS coordinates file is millions of entries long, but could be split if necessary. The bus stop dataset is around 5500 entries long.
Using the code and tips given on, inter alia, these pages:

1) How to efficiently calculate distance between pair of coordinates using data.table :=

2) Using a simple for loop on spatial data

3) Calculate distance between two latitude-longitude points? (Haversine formula)

4) Fastest way to determine COUNTRY from millions of GPS coordinates [R]

I was able to construct a code that works, but is (too) slow. I was wondering if someone can help me with a faster data.table() implementation or can point out where the bottle neck in my code is? Is it the spDistsN1() function, or maybe the apply and melt() functions combination? I am most comfortable in R, but open to other software (as long as it is open source).

Due to privacy concerns I cannot upload the full dataset, but this is a (small) reproducible example that is not too different from how the real data looks.

# packages:
library(data.table)
library(tidyverse)
library(sp)


# create GPS data
number_of_GPS_coordinates <- 20000
set.seed(1)
gpsdata<-as.data.frame(cbind(id=1:number_of_GPS_coordinates, 
                             lat=runif(number_of_GPS_coordinates,50.5,53.5), 
                             lon=runif(number_of_GPS_coordinates,4,7)))

# create some busstop data. In this case only 2000 bus stops
set.seed(1)
number_of_bus_stops <- 2000
stop<-as.data.frame(gpsdata[sample(nrow(gpsdata), number_of_bus_stops), -1]) # of course do not keep id variable
stop$lat<-stop$lat+rnorm(number_of_bus_stops,0,.0005)
stop$lon<-stop$lon+rnorm(number_of_bus_stops,0,.0005)
busdata.data<-cbind(stop, name=replicate(number_of_bus_stops, paste(sample(LETTERS, 15, replace=TRUE), collapse="")))

names(busdata.data) <- c("latitude_bustops",  "longitude_bustops", "name")

Download the real bus stop data if you want, kind of hard to reproduce a random sample of this.

#temp <- tempfile()
#download.file("http://data.openov.nl/haltes/stops.csv.gz", temp) #1.7MB
#gzfile(temp, 'rt')
#busstopdata <- read.csv(temp, stringsAsFactors = FALSE)
#unlink(temp)
#bus_stops <- fread("bus_stops.csv")
#busdata.data <- busstopdata %>%
#  mutate(latitude_bustops = latitude)%>%
#  mutate(longitude_bustops = longitude)%>%
#  dplyr::select(name, latitude_bustops,  longitude_bustops)

Code I use now to calculate distances. It works but it is pretty slow

countDataPoints3 <- function(p) {
  distances <- spDistsN1(data.matrix(gpsdata[,c("lon","lat")]), 
                         p,
                         longlat=TRUE) # in km
  return(which(distances <= .2)) # distance is now set to 200 meters
}


# code to check per data point if a bus stop is near and save this per bus stop in a list entry
datapoints.by.bustation       <- apply(data.matrix(busdata.data[,c("longitude_bustops","latitude_bustops")]), 1, countDataPoints3)


# rename list entries
names(datapoints.by.bustation) <- busdata.data$name

# melt list into one big data.frame
long.data.frame.busstops       <- melt(datapoints.by.bustation)

# now switch to data.table grammar to speed up process
# set data.table
setDT(gpsdata)
gpsdata[, rowID := 1:nrow(gpsdata)]
setkey(gpsdata, key = "rowID")
setDT(long.data.frame.busstops)

# merge the data, and filter non-unique entries 
setkey(long.data.frame.busstops, key = "value")
GPS.joined        <- merge(x = gpsdata, y = long.data.frame.busstops, by.x= "rowID", by.y= "value", all.x=TRUE)
GPS.joined.unique <- unique(GPS.joined, by="id") # mak

# this last part of the code is needed to make sure that if there are more than 1 bus stop nearby it puts these bus stop in a list
# instead of adding row and making the final data.frame longer than the original one
GPS.joined.unique2 <- setDT(GPS.joined.unique)[order(id, L1), list(L1=list(L1)), by=id]
GPS.joined.unique2[, nearby := TRUE][is.na(L1), nearby := FALSE] # add a dummy to check if any bus stop is nearby.

# makes sense:
as.tibble(GPS.joined.unique2) %>%
  summarize(sum = sum(nearby))

You have a problem of scale here. you are creating a matrix of millions by 5000 thousand, for trillions of distances yet are only interested in distances within 200 meters. This problem is ripe for a divide and conquer approach. Dividing you GPS coordinates and stops to smaller areas would reduce the number of calculations needed and improve performance. — Dave2e
@SymbolixAU. Thank you for the link. I had already seen that page. I tried that approach, but it is not really applicable in my situation, since I am not calculating the distance between (consecutive) data points in one data.table but comparing all data points in one data table with all data in another table. Which, as Dave2e points out leads to the explosion of the size of the data set and thus the computation time. How would you recommend splitting by smaller areas, consedering there is no ID that links the bus stops to a specific area. — L Smeets
@Dave2e, How would you recommend splitting by smaller areas, consedering there is no ID that links the bus stops to a specific area. — L Smeets
I would use the cut function with lat and long to make a grid of GPS, then as you step through each cell in the grid filter the stop locations to only those points with the lat & long (plus a little buffer on each side) and then use your above algorithm. The big question is the grid size, make it to big then no benefits and too small to many cells. I would try keeping the number of stops per grid cell to <20-30 (maybe). Good luck. — Dave2e

Hugh Hugh · Accepted Answer · 2018-11-10T13:42:24

Consider cutting using an slicing method: first cut by close latitudes and close longitudes. In this case 0.5 latitude and 0.5 longitude (which is still about a 60 km disc). We can use data.table's superb support of rolling joins.

The following takes a few milliseconds for 20,000 entries and only a few seconds for 2M entries.

library(data.table)
library(hutils)
setDT(gpsdata)
setDT(busdata.data)

gps_orig <- copy(gpsdata)
busdata.orig <- copy(busdata.data)

setkey(gpsdata, lat)

# Just to take note of the originals
gpsdata[, gps_lat := lat + 0]
gpsdata[, gps_lon := lon + 0]

busdata.data[, lat := latitude_bustops + 0]
busdata.data[, lon := longitude_bustops + 0]


setkey(busdata.data, lat)

gpsID_by_lat <- 
  gpsdata[, .(id), keyby = "lat"]


By_latitude <- 
  busdata.data[gpsdata, 
               on = "lat",

               # within 0.5 degrees of latitude
               roll = 0.5, 
               # +/-
               rollends = c(TRUE, TRUE),

               # and remove those beyond 0.5 degrees
               nomatch=0L] %>%
  .[, .(id_lat = id,
        name_lat = name,
        bus_lat = latitude_bustops,
        bus_lon = longitude_bustops,
        gps_lat,
        gps_lon),
    keyby = .(lon = gps_lon)]

setkey(busdata.data, lon)

By_latlon <-
  busdata.data[By_latitude,
               on = c("name==name_lat", "lon"),

               # within 0.5 degrees of latitude
               roll = 0.5, 
               # +/-
               rollends = c(TRUE, TRUE),
               # and remove those beyond 0.5 degrees
               nomatch=0L]

By_latlon[, distance := haversine_distance(lat1 = gps_lat, 
                                           lon1 = gps_lon,
                                           lat2 = bus_lat,
                                           lon2 = bus_lon)]

By_latlon[distance < 0.2]

How to efficiently calculate distance between GPS points in one dataset and GPS points in another data set using data.table

2 Answers