I have two data frames and the dput of the data frames are given below:
dput(df1)
structure(list(observationtime = structure(c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
), format = "h:m:s", class = "times"), observationdate = structure(c(15309,
15309, 15309, 15309, 15309, 15309, 15309, 15309, 15309, 15309,
15309, 15309, 15309, 15309, 15309, 15309, 15309, 15309, 15309,
15309), class = "Date"), screentemperature = c(9.1, 9.4, 6.8,
5.7, 1.6, 6.3, 5.2, 4.7, 6.8, 8.7, 6.9, 9.6, 9.4, 9.5, 8.1, 7.7,
7.9, 8.9, 6.6, 6.8), significantweathercode = c(8L, 7L, 8L, 9L,
15L, 12L, 12L, 15L, 15L, 9L, 15L, 7L, 2L, 7L, 12L, 8L, 12L, 8L,
8L, 12L), latitude = c(60.139, 58.954, 57.358, 58.214, 57.725,
57.859, 57.257, 58.288, 56.867, 57.82, 57.206, 58.454, 57.6494,
57.712, 57.077, 56.85, 57.206, 57.698, 56.497, 55.681), longitude = c(-1.183,
-2.9, -7.397, -6.325, -4.896, -5.636, -5.809, -4.442, -4.708,
-3.97, -3.827, -3.089, -3.5606, -3.322, -2.836, -2.27, -2.202,
-2.121, -6.887, -6.256)), .Names = c("observationtime", "observationdate",
"screentemperature", "significantweathercode", "latitude", "longitude"
), row.names = c(1L, 3L, 4L, 5L, 6L, 7L, 8L, 10L, 11L, 12L, 13L,
14L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L), class = "data.frame")
dput(df2)
structure(list(time = structure(c(0.0104166666666667, 0.0590277777777778
), format = "h:m:s", class = "times"), data = structure(c(15309,
15310), class = "Date"), latitude = c(53.674, 43.978), longitude = c(-1.222,
-5.346)), .Names = c("time", "data", "latitude", "longitude"), row.names = c(NA,
-2L), class = "data.frame")
I am trying to extract the temprature and significantweathercode from df2 using values from df1. For example, the dates from df2 will be matched fully to dates with df1 but times may not be the same so I need to identifiy a threshold of say 15-30 minutes to find the closest time and at the same time I need to match the coordinates as they may not match exactly and a threshold will be required to find the closest match. The steps are:
- Match the date
- Match the location/coordiantes, if not exact match, find the closest
- Match the time, if not exact match, find the closest time
The solution I have is
- Subset the data from df1 based on date
- Select the observation with the minimum time difference and smallest distance
- Find the difference between times and coordinates
In reality the data is large containing close to half a million rows for each of the data frames. I am trying to find out an efficient way in r to try and do such matching.
Suggestions will be welcomes with possible work arounds.