Finding closest coordinates between two large data sets

Question

I am aiming to identify the nearest entry in dataset 2 to each entry in dataset 1 based on the coordinates in both datasets. Dataset 1 contains 180,000 rows (only 1,800 unique coordinates) and dataset 2 contains contains 4,500 rows (full 4,500 unique coordinates).

I have attempted to replicate the answers from similar questions on stackoverflow. for example:

R - Finding closest neighboring point and number of neighbors within a given radius, coordinates lat-long

Calculating the distance between points in different data frames

However these do not solve the problem in the way I want (they either join the data frames or check the distances within a single dataframe).

The solution in Find the nearest X,Y coordinate using R and related posts are the closest I have found so far.

My issue with the post is that it works out the distance between coordinates within a single dataframe, and I have been unable to understand which parameters to change in RANN::nn2 to do it across two data frames.

Proposed code that doesn't work:

library(RANN)
dataset1[,4]<- nn2(data=dataset1, query=dataset2, k=2)

Notes/Questions:

1) Which dataset should be provided to the query to find the closest value in dataset 2 to a given value in dataset 1?

2) Is there any way to avoid the problem that the datasets seem to need to be the same width (number of columns)?

3) How can the outputs (SRD_ID and distance) be added to the relevant entry in dataset 1?

4) What is the use of eps parameter in the RANN::nn2 function?

The aim is to populate the SRC_ID and distance columns in dataset 1 with the nearest station ID from dataset 2 and the distance between the entry in dataset 1 and the nearest entry in dataset 2.

Below is a table demostrating the expected results. Note: the SRC_ID and distance values are example values I have manually added myself, are almost certainly incorrect and will likely not be replicated by the code.

       id HIGH_PRCN_LAT HIGH_PRCN_LON SRC_ID distance
1 3797987      52.88121     -2.873734     55      350 
2 3798045      53.80945     -2.439163     76     2100

Data:

_{r details}

platform        x86_64-w64-mingw32
version.string  R version 3.5.3 (2019-03-11)

_{data set 1 input (not narrowed down to unique coordinates)}

structure(list(id = c(1L, 2L, 4L, 5L, 
6L, 7L, 8L, 9, 10L, 3L), 
    HIGH_PRCN_LAT = c(52.881442267773, 57.8094538200198, 34.0233529, 
    63.8087900198, 53.6888144440184, 63.4462810678651, 21.6075544376207, 
    78.324442654172, 66.85532539759495, 51.623544596), HIGH_PRCN_LON = c(-2.87377812157822, 
    -2.23454414781635, -3.0984448341, -2.439163178635, -7.396111601421454, 
    -5.162345043546359, -8.63311254098095, 3.813289888829932, 
    -3.994325961186105, -8.9065532453272409), SRC_ID = c(NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA), distance = c(NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 10L), class = "data.frame")

_{data set 2 input}

structure(list(SRC_ID = c(55L, 54L, 23L, 11L, 44L, 21L, 76L, 
5688L, 440L, 61114L), HIGH_PRCN_LAT = c(68.46506, 50.34127, 61.16432, 
42.57807, 52.29879, 68.52132, 87.83912, 55.67825, 29.74444, 34.33228
), HIGH_PRCN_LON = c(-5.0584, -5.95506, -5.75546, -5.47801, -3.42062, 
-6.99441, -2.63457, -2.63057, -7.52216, -1.65532)), row.names = c(NA, 
10L), class = "data.frame")

If they are lat and long you need to think about using geosphere::distHaversine to get the right distance. But finding the closest points works as suggested in the answers. — M--

M-- M-- · Accepted Answer · 2019-04-18T19:36:44

I wrote up an answer referencing this thread. The function is modified to take care of reporting the distance and avoid hard-coding. Please note that it calculates Euclidean distance.

library(data.table)
#Euclidean distance 
mydist <- function(a, b, df1, x, y){

          dt <- data.table(sqrt((df1[[x]]-a)^2 + (df1[[y]]-b)^2))

          return(data.table(Closest.V1  = which.min(dt$V1),
                            Distance    = dt[which.min(dt$V1)]))
           }

setDT(df1)[, j = mydist(HIGH_PRCN_LAT, HIGH_PRCN_LON, setDT(df2), 
                        "HIGH_PRCN_LAT", "HIGH_PRCN_LON"), 
                         by = list(id, HIGH_PRCN_LAT, HIGH_PRCN_LON)]

  #     id HIGH_PRCN_LAT HIGH_PRCN_LON Closest.V1 Distance.V1
  # 1:   1      52.88144     -2.873778          5   0.7990743
  # 2:   2      57.80945     -2.234544          8   2.1676868
  # 3:   4      34.02335     -3.098445         10   1.4758202
  # 4:   5      63.80879     -2.439163          3   4.2415854
  # 5:   6      53.68881     -7.396112          2   3.6445416
  # 6:   7      63.44628     -5.162345          3   2.3577811
  # 7:   8      21.60755     -8.633113          9   8.2123762
  # 8:   9      78.32444      3.813290          7  11.4936496
  # 9:  10      66.85533     -3.994326          1   1.9296370
  # 10:  3      51.62354     -8.906553          2   3.2180026

You can use RANN::nn2, but you need to make sure to use the right syntax. Following works!

as.data.frame(RANN::nn2(df2[,c(2,3)],df1[,c(2,3)],k=1))

#    nn.idx   nn.dists
# 1       5  0.7990743
# 2       8  2.1676868
# 3      10  1.4758202
# 4       3  4.2415854
# 5       2  3.6445416
# 6       3  2.3577811
# 7       9  8.2123762
# 8       7 11.4936496
# 9       1  1.9296370
# 10      2  3.2180026

Finding closest coordinates between two large data sets

Data:

2 Answers