I am aiming to identify the nearest entry in dataset 2 to each entry in dataset 1 based on the coordinates in both datasets. Dataset 1 contains 180,000 rows (only 1,800 unique coordinates) and dataset 2 contains contains 4,500 rows (full 4,500 unique coordinates).
I have attempted to replicate the answers from similar questions on stackoverflow. for example:
Calculating the distance between points in different data frames
However these do not solve the problem in the way I want (they either join the data frames or check the distances within a single dataframe).
The solution in Find the nearest X,Y coordinate using R and related posts are the closest I have found so far.
My issue with the post is that it works out the distance between coordinates within a single dataframe, and I have been unable to understand which parameters to change in RANN::nn2
to do it across two data frames.
Proposed code that doesn't work:
library(RANN)
dataset1[,4]<- nn2(data=dataset1, query=dataset2, k=2)
Notes/Questions:
1) Which dataset should be provided to the query to find the closest value in dataset 2 to a given value in dataset 1?
2) Is there any way to avoid the problem that the datasets seem to need to be the same width (number of columns)?
3) How can the outputs (SRD_ID
and distance
) be added to the relevant entry in dataset 1?
4) What is the use of eps
parameter in the RANN::nn2
function?
The aim is to populate the SRC_ID
and distance
columns in dataset 1 with the nearest station ID from dataset 2 and the distance between the entry in dataset 1 and the nearest entry in dataset 2.
Below is a table demostrating the expected results. Note: the SRC_ID
and distance
values are example values I have manually added myself, are almost certainly incorrect and will likely not be replicated by the code.
id HIGH_PRCN_LAT HIGH_PRCN_LON SRC_ID distance
1 3797987 52.88121 -2.873734 55 350
2 3798045 53.80945 -2.439163 76 2100
Data:
r details
platform x86_64-w64-mingw32
version.string R version 3.5.3 (2019-03-11)
data set 1 input (not narrowed down to unique coordinates)
structure(list(id = c(1L, 2L, 4L, 5L,
6L, 7L, 8L, 9, 10L, 3L),
HIGH_PRCN_LAT = c(52.881442267773, 57.8094538200198, 34.0233529,
63.8087900198, 53.6888144440184, 63.4462810678651, 21.6075544376207,
78.324442654172, 66.85532539759495, 51.623544596), HIGH_PRCN_LON = c(-2.87377812157822,
-2.23454414781635, -3.0984448341, -2.439163178635, -7.396111601421454,
-5.162345043546359, -8.63311254098095, 3.813289888829932,
-3.994325961186105, -8.9065532453272409), SRC_ID = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), distance = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 10L), class = "data.frame")
data set 2 input
structure(list(SRC_ID = c(55L, 54L, 23L, 11L, 44L, 21L, 76L,
5688L, 440L, 61114L), HIGH_PRCN_LAT = c(68.46506, 50.34127, 61.16432,
42.57807, 52.29879, 68.52132, 87.83912, 55.67825, 29.74444, 34.33228
), HIGH_PRCN_LON = c(-5.0584, -5.95506, -5.75546, -5.47801, -3.42062,
-6.99441, -2.63457, -2.63057, -7.52216, -1.65532)), row.names = c(NA,
10L), class = "data.frame")
geosphere::distHaversine
to get the right distance. But finding the closest points works as suggested in the answers. – M--