1
votes

I have two data frames and the dput of the data frames are given below:

dput(df1)

structure(list(observationtime = structure(c(0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
), format = "h:m:s", class = "times"), observationdate = structure(c(15309, 
15309, 15309, 15309, 15309, 15309, 15309, 15309, 15309, 15309, 
15309, 15309, 15309, 15309, 15309, 15309, 15309, 15309, 15309, 
15309), class = "Date"), screentemperature = c(9.1, 9.4, 6.8, 
5.7, 1.6, 6.3, 5.2, 4.7, 6.8, 8.7, 6.9, 9.6, 9.4, 9.5, 8.1, 7.7, 
7.9, 8.9, 6.6, 6.8), significantweathercode = c(8L, 7L, 8L, 9L, 
15L, 12L, 12L, 15L, 15L, 9L, 15L, 7L, 2L, 7L, 12L, 8L, 12L, 8L, 
8L, 12L), latitude = c(60.139, 58.954, 57.358, 58.214, 57.725, 
57.859, 57.257, 58.288, 56.867, 57.82, 57.206, 58.454, 57.6494, 
57.712, 57.077, 56.85, 57.206, 57.698, 56.497, 55.681), longitude = c(-1.183, 
-2.9, -7.397, -6.325, -4.896, -5.636, -5.809, -4.442, -4.708, 
-3.97, -3.827, -3.089, -3.5606, -3.322, -2.836, -2.27, -2.202, 
-2.121, -6.887, -6.256)), .Names = c("observationtime", "observationdate", 
"screentemperature", "significantweathercode", "latitude", "longitude"
), row.names = c(1L, 3L, 4L, 5L, 6L, 7L, 8L, 10L, 11L, 12L, 13L, 
14L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L), class = "data.frame")

dput(df2)

structure(list(time = structure(c(0.0104166666666667, 0.0590277777777778
), format = "h:m:s", class = "times"), data = structure(c(15309, 
15310), class = "Date"), latitude = c(53.674, 43.978), longitude = c(-1.222, 
-5.346)), .Names = c("time", "data", "latitude", "longitude"), row.names = c(NA, 
-2L), class = "data.frame")

I am trying to extract the temprature and significantweathercode from df2 using values from df1. For example, the dates from df2 will be matched fully to dates with df1 but times may not be the same so I need to identifiy a threshold of say 15-30 minutes to find the closest time and at the same time I need to match the coordinates as they may not match exactly and a threshold will be required to find the closest match. The steps are:

  • Match the date
  • Match the location/coordiantes, if not exact match, find the closest
  • Match the time, if not exact match, find the closest time

The solution I have is

  • Subset the data from df1 based on date
  • Select the observation with the minimum time difference and smallest distance
  • Find the difference between times and coordinates

In reality the data is large containing close to half a million rows for each of the data frames. I am trying to find out an efficient way in r to try and do such matching.

Suggestions will be welcomes with possible work arounds.

1
Much will depend on how much unique locations you have in each dataframe. - Thierry
In df1 there are 170 unique locations but in df1 there are many. We are talking about thousands. - syebill

1 Answers

0
votes

A database solution would be more efficient than a pure R solution. E.g. a PostgreSQL database with PostGIS extension would allow for spatial queries.