0
votes

I have two unequally sized data frames called OBS and REF. Both contain 9864 columns - columns 1 and 2 are the longitude and latitude, and columns 3-9864 are daily temperature values for 27 years. They have vastly different numbers of rows - OBS has 12375 and REF has just 504.

OBS: 12375 obs. of 9864 variables

Lon     Lat     1979.01.01     1979.01.02     1979.01.03
0.000   39.75   13.69          13.14          13.32
1.000   39.75   12.93          12.41          12.59
2.000   39.75   11.78          10.62          11.15
3.000   39.75   11.73          10.94          12.16

REF: 504 obs. of 9864 variables

Lon      Lat       1979.01.01     1979.01.02     1979.01.03
0.0000   37.6559   12.69          12.35          12.60
2.8125   37.6559   13.43          12.97          13.23
5.6250   37.6559   13.91          13.64          13.71
8.4375   37.6559   14.12          14.24          14.01

What I want to do is output a new data frame (REF_Closest) that takes the coordinates from the OBS data frame, finds the closest coordinates from the REF data frame, and outputs these coordinates and all the variables to the right of this, e.g.:

REF_Closest: 12375 obs. of 9864 variables

Lon      Lat       1979.01.01     1979.01.02     1979.01.03
0.0000   37.6559   12.69          12.35          12.60
0.0000   37.6559   12.69          12.35          12.60
2.8125   37.6559   13.43          12.97          13.23
2.8125   37.6559   13.43          12.97          13.23

I have tried adapting other similar questions but cannot work this out. Any suggestions?

2
I don't understand from your question why REF_Closest has 9864 rows, or why the first two rows are repeated? - Stephen Henderson
REF_Closest has 9864 columns rather than rows. The first two rows are repeated because the closest coordinates to rows 1 and 2 of the OBS data frame are from row 1 from the REF data frame. The closest coordinates to rows 3 and 4 of the OBS data frame are from row 3 from the REF data frame etc. - DJ-AFC
OK if I understand you correctly this probably has a solution: stackoverflow.com/q/27321856/1527403 - Stephen Henderson

2 Answers

1
votes

I came up with a data.table solution:

library(data.table)

# we will use this dummy variable to group by
ref[, id := 1:.N]

ref[, cbind(LonRef = Lon, 
      LatRef = Lat, 
      obs[which.min(as.matrix(dist(rbind(ref[.GRP, 1:2], 
                                         obs[, 1:2])))[2:(.N+1), 1]), ]), 
      by = id]

What is in there:

  • cbind binds the columns of three different sources:
    • Ref's lat
    • Ref's lon
    • and a data.frame (data.table too) with the observed readings.
  • That data.frame is calculated as follows:
    • Calculate the ecuclidean distance with dist from ref's current row (hence the .GRP) to all lat-lon pairs in obs (hence the rbind).
    • identify the minimum of the first column, excluding itself (henche the (2:.N))
    • Return the full row of obs for such observed minimum distance (that's the obs[...)

After which I obtained:

   id LonRef  LatRef Lon   Lat 1979.01.01 1979.01.02 1979.01.03
1:  1 0.0000 37.6559   0 39.75      13.69      13.14      13.32
2:  2 2.8125 37.6559   3 39.75      11.73      10.94      12.16
3:  3 5.6250 37.6559   3 39.75      11.73      10.94      12.16
4:  4 8.4375 37.6559   3 39.75      11.73      10.94      12.16

I used data:

obs <- fread("Lon     Lat     1979.01.01     1979.01.02     1979.01.03
 0.000   39.75   13.69          13.14          13.32
 1.000   39.75   12.93          12.41          12.59
 2.000   39.75   11.78          10.62          11.15
 3.000   39.75   11.73          10.94          12.16")

ref <- fread("Lon      Lat       1979.01.01     1979.01.02     1979.01.03
 0.0000   37.6559   12.69          12.35          12.60
 2.8125   37.6559   13.43          12.97          13.23
 5.6250   37.6559   13.91          13.64          13.71
+ 8.4375   37.6559   14.12          14.24          14.01")
0
votes

With a slight alteration - adding an extra row = this will give you the row numbers in OBS that have closest (euclidean) distance to each row of REF (i.e. its redundant).

> REF
     Lon     Lat X1979.01.01 X1979.01.02 X1979.01.03
1 0.0000 37.6559       12.69       12.35       12.60
2 2.8125 37.6559       13.43       12.97       13.23
3 5.6250 37.6559       13.91       13.64       13.71
4 8.4375 37.6559       14.12       14.24       14.01
> OBS
  Lon   Lat X1979.01.01 X1979.01.02 X1979.01.03
1   0 39.75       13.69       13.14       13.32
2   1 39.75       12.93       12.41       12.59
3   2 39.75       11.78       10.62       11.15
4   3 39.75       11.73       10.94       12.16
5   8 38.50       12.34       14.23       17.23


nearest.rows <- apply (OBS,1, function(OBSrow) 
                       which.min(sqrt((OBSrow[1] - REF$Lon)^2 + (OBSrow[2] - REF$Lat)^2))
                      )

> nearest.rows
[1] 1 1 2 2 4