EDIT 30.09.2019: Added examples and modified code for reproducibility.
EDIT 01.10.2019: Better Explanation.
I'm merging two different datasets of a transportation system. One (df1) is the schedules at which trains should be at a certain station, the other (df2) is the observations of trains passing by those stations.
Since it's common that some schedules can't be fulfilled (a train broke, services need to be removed, construction sites, etc etc), the observations dataset contains fewer entries than the schedules one.
My goal is to measure the delays of the realized schedules, from a passenger perspective. This means: if a passenger knows at what time the train should be departing a station, the delay he sees corresponds to the train leaving at the exact scheduled time (delay = 0) or the immediate next train (delay = depart_observed - depart_scheduled).
The goal is translated into the following task:
merge both datasets by:
-stop_id: Since I need to match observations for each station, this is obviously the first matching criteria.
-timestamp: for every schedule timestamp I should find the immediate consecutive observation timestamp.
The result should show all possible schedules (unique, no repetitions) matched with the closest following observation (unique too, no repetition). Again, I need to make sure that the way the merging is performed, only observations that happen AFTER or SIMULTANEOUSLY to the schedule are matched with the schedule's proper timestamp. Given the size difference between both datasets, I'm expecting plenty NAs in the observed column, since the schedules can't be paired with an observation.
For simplicity, the provided examples contain only the two columns I'm using for the join:
stop_id (id for the station) and timestamp (date & time at which the trains should be or were observed at the stations).
The approach I'm using is using a rolling join with Data.Table in R. This works great, except for the fact that whenever I'm creating the join, I keep getting duplicates of one dataset, and I can't have those repetitions in the merged set.
The code I'm using:
#DECLARING FUNCTIONS (Two functions according to df1[df2] and df2[df1])
merge_schedule_obs <- function(df1, df2) {
setDT(df1)
setDT(df2)
max_delay <- 3600 # 1-hour max delay for pairing schedule and occurrence
setkey(df1, stop_id, departure)[, departScheduled:=departure]
df3 <- df1[df2, roll = max_delay]
return(df3)
}
merge_schedule_obs2 <- function(df1, df2) {
setDT(df1)
setDT(df2)
max_delay <- 3600 # 1-hour max delay for pairing schedule and occurrence
setkey(df1, stop_id, departure)[, departObserved:=departure]
df3 <- df1[df2, roll = -max_delay]
return(df3)
}
#CREATING SOME SAMPLE DATA (The real dataset is significantly larger. But this example should cover the problem)
#Sample Schedule Data:
df1 <- distinct(data.table(stop_id = as.factor(c("70015","70009", "70003", "70019","70013", "70007", "70019", "70005", "70007", "70019", "70005")),
departure = as.POSIXct(c("2019-09-09 06:57:00","2019-09-09 06:57:00", "2019-09-09 06:57:00",
"2019-09-09 06:54:30","2019-09-09 06:54:00", "2019-09-09 06:55:00",
"2019-09-09 06:55:30","2019-09-09 06:55:00", "2019-09-09 06:55:10",
"2019-09-09 06:55:00", "2019-09-09 06:58:00"))))
Out:
stop_id departure
1: 70015 2019-09-09 06:57:00
2: 70009 2019-09-09 06:57:00
3: 70003 2019-09-09 06:57:00
4: 70019 2019-09-09 06:54:30
5: 70013 2019-09-09 06:54:00
6: 70007 2019-09-09 06:55:00
7: 70019 2019-09-09 06:55:30
8: 70005 2019-09-09 06:55:00
9: 70007 2019-09-09 06:55:10
10: 70019 2019-09-09 06:55:00
11: 70005 2019-09-09 06:58:00
#Sample Observations Data:
df2 <- distinct(data.table(stop_id = as.factor(c("70013","70009", "70015", "70005", "70005", "70007", "70019")),
departure = as.POSIXct(c("2019-09-09 06:57:10","2019-09-09 07:00:17", "2019-09-09 07:00:12", "2019-09-09 06:58:20", "2019-09-09 06:58:00",
"2019-09-09 06:57:30", "2019-09-09 06:57:35")))
)
Out:
stop_id departure
1: 70013 2019-09-09 06:57:10
2: 70009 2019-09-09 07:00:17
3: 70015 2019-09-09 07:00:12
4: 70005 2019-09-09 06:58:20
5: 70005 2019-09-09 06:58:00
6: 70007 2019-09-09 06:57:30
7: 70019 2019-09-09 06:57:35
#MERGING DATASETS: (Both directions are covered, and the problem shows in both)
merged_df <- distinct(na.omit(merge_schedule_obs(df1,df2)))
Out:
stop_id departure departScheduled
1: 70005 2019-09-09 06:58:00 2019-09-09 06:58:00
2: 70005 2019-09-09 06:58:20 2019-09-09 06:58:00
3: 70007 2019-09-09 06:57:30 2019-09-09 06:55:10
4: 70009 2019-09-09 07:00:17 2019-09-09 06:57:00
5: 70013 2019-09-09 06:57:10 2019-09-09 06:54:00
6: 70015 2019-09-09 07:00:12 2019-09-09 06:57:00
7: 70019 2019-09-09 06:57:35 2019-09-09 06:55:30
merged_df2 <- distinct(na.omit(merge_schedule_obs2(df2,df1)))
Out:
stop_id departure departObserved
1: 70005 2019-09-09 06:55:00 2019-09-09 06:58:00
2: 70005 2019-09-09 06:58:00 2019-09-09 06:58:00
3: 70007 2019-09-09 06:55:00 2019-09-09 06:57:30
4: 70007 2019-09-09 06:55:10 2019-09-09 06:57:30
5: 70009 2019-09-09 06:57:00 2019-09-09 07:00:17
6: 70013 2019-09-09 06:54:00 2019-09-09 06:57:10
7: 70015 2019-09-09 06:57:00 2019-09-09 07:00:12
8: 70019 2019-09-09 06:54:30 2019-09-09 06:57:35
9: 70019 2019-09-09 06:55:00 2019-09-09 06:57:35
10: 70019 2019-09-09 06:55:30 2019-09-09 06:57:35
The result of running the rolling join with the datasets df1 and df2 create a duplicate entry of the schedule in station 70005 (if I run the join the other way around df1[df2, roll = max_delay, ...] the duplicates happen in the observation instead, in stations 70005 and 70019). Basically, I can't get rid of them. I tried the options with mult (first, last) and trying to play with rollends... still always unwanted duplicates...
How can I merge these datasets without any duplicates?
Thank you!