Rolling join in Data.Table by two variables without creating duplicates

Question

EDIT 30.09.2019: Added examples and modified code for reproducibility.

EDIT 01.10.2019: Better Explanation.

I'm merging two different datasets of a transportation system. One (df1) is the schedules at which trains should be at a certain station, the other (df2) is the observations of trains passing by those stations.

Since it's common that some schedules can't be fulfilled (a train broke, services need to be removed, construction sites, etc etc), the observations dataset contains fewer entries than the schedules one.

My goal is to measure the delays of the realized schedules, from a passenger perspective. This means: if a passenger knows at what time the train should be departing a station, the delay he sees corresponds to the train leaving at the exact scheduled time (delay = 0) or the immediate next train (delay = depart_observed - depart_scheduled).
The goal is translated into the following task: merge both datasets by:
-stop_id: Since I need to match observations for each station, this is obviously the first matching criteria.
-timestamp: for every schedule timestamp I should find the immediate consecutive observation timestamp.
The result should show all possible schedules (unique, no repetitions) matched with the closest following observation (unique too, no repetition). Again, I need to make sure that the way the merging is performed, only observations that happen AFTER or SIMULTANEOUSLY to the schedule are matched with the schedule's proper timestamp. Given the size difference between both datasets, I'm expecting plenty NAs in the observed column, since the schedules can't be paired with an observation.

For simplicity, the provided examples contain only the two columns I'm using for the join:

stop_id (id for the station) and timestamp (date & time at which the trains should be or were observed at the stations).

The approach I'm using is using a rolling join with Data.Table in R. This works great, except for the fact that whenever I'm creating the join, I keep getting duplicates of one dataset, and I can't have those repetitions in the merged set.

The code I'm using:

#DECLARING FUNCTIONS (Two functions according to df1[df2] and df2[df1])

merge_schedule_obs <- function(df1, df2) {
  setDT(df1)
  setDT(df2)
  max_delay <- 3600 # 1-hour max delay for pairing schedule and occurrence
  setkey(df1, stop_id, departure)[, departScheduled:=departure]
  df3 <- df1[df2, roll = max_delay]
  return(df3)
}

merge_schedule_obs2 <- function(df1, df2) {
  setDT(df1)
  setDT(df2)
  max_delay <- 3600 # 1-hour max delay for pairing schedule and occurrence
  setkey(df1, stop_id, departure)[, departObserved:=departure]
  df3 <- df1[df2, roll = -max_delay]
  return(df3)
}

#CREATING SOME SAMPLE DATA (The real dataset is significantly larger. But this example should cover the problem)


#Sample Schedule Data:

df1 <- distinct(data.table(stop_id = as.factor(c("70015","70009", "70003", "70019","70013", "70007", "70019", "70005", "70007", "70019", "70005")),
                           departure = as.POSIXct(c("2019-09-09 06:57:00","2019-09-09 06:57:00", "2019-09-09 06:57:00",
                                                    "2019-09-09 06:54:30","2019-09-09 06:54:00", "2019-09-09 06:55:00",
                                                    "2019-09-09 06:55:30","2019-09-09 06:55:00", "2019-09-09 06:55:10",
                                                    "2019-09-09  06:55:00", "2019-09-09  06:58:00"))))



Out:

    stop_id           departure
 1:   70015 2019-09-09 06:57:00
 2:   70009 2019-09-09 06:57:00
 3:   70003 2019-09-09 06:57:00
 4:   70019 2019-09-09 06:54:30
 5:   70013 2019-09-09 06:54:00
 6:   70007 2019-09-09 06:55:00
 7:   70019 2019-09-09 06:55:30
 8:   70005 2019-09-09 06:55:00
 9:   70007 2019-09-09 06:55:10
10:   70019 2019-09-09 06:55:00
11:   70005 2019-09-09 06:58:00

#Sample Observations Data:

df2 <- distinct(data.table(stop_id = as.factor(c("70013","70009", "70015", "70005", "70005", "70007", "70019")),
                           departure = as.POSIXct(c("2019-09-09 06:57:10","2019-09-09 07:00:17", "2019-09-09 07:00:12",  "2019-09-09  06:58:20", "2019-09-09  06:58:00",
                                                    "2019-09-09 06:57:30", "2019-09-09 06:57:35")))
)

Out:

   stop_id           departure
1:   70013 2019-09-09 06:57:10
2:   70009 2019-09-09 07:00:17
3:   70015 2019-09-09 07:00:12
4:   70005 2019-09-09 06:58:20
5:   70005 2019-09-09 06:58:00
6:   70007 2019-09-09 06:57:30
7:   70019 2019-09-09 06:57:35

#MERGING DATASETS: (Both directions are covered, and the problem shows in both)

merged_df <- distinct(na.omit(merge_schedule_obs(df1,df2))) 

Out:

   stop_id           departure     departScheduled
1:   70005 2019-09-09 06:58:00 2019-09-09 06:58:00
2:   70005 2019-09-09 06:58:20 2019-09-09 06:58:00
3:   70007 2019-09-09 06:57:30 2019-09-09 06:55:10
4:   70009 2019-09-09 07:00:17 2019-09-09 06:57:00
5:   70013 2019-09-09 06:57:10 2019-09-09 06:54:00
6:   70015 2019-09-09 07:00:12 2019-09-09 06:57:00
7:   70019 2019-09-09 06:57:35 2019-09-09 06:55:30

merged_df2 <- distinct(na.omit(merge_schedule_obs2(df2,df1))) 

Out:
    stop_id           departure      departObserved
 1:   70005 2019-09-09 06:55:00 2019-09-09 06:58:00
 2:   70005 2019-09-09 06:58:00 2019-09-09 06:58:00
 3:   70007 2019-09-09 06:55:00 2019-09-09 06:57:30
 4:   70007 2019-09-09 06:55:10 2019-09-09 06:57:30
 5:   70009 2019-09-09 06:57:00 2019-09-09 07:00:17
 6:   70013 2019-09-09 06:54:00 2019-09-09 06:57:10
 7:   70015 2019-09-09 06:57:00 2019-09-09 07:00:12
 8:   70019 2019-09-09 06:54:30 2019-09-09 06:57:35
 9:   70019 2019-09-09 06:55:00 2019-09-09 06:57:35
10:   70019 2019-09-09 06:55:30 2019-09-09 06:57:35

The result of running the rolling join with the datasets df1 and df2 create a duplicate entry of the schedule in station 70005 (if I run the join the other way around df1[df2, roll = max_delay, ...] the duplicates happen in the observation instead, in stations 70005 and 70019). Basically, I can't get rid of them. I tried the options with mult (first, last) and trying to play with rollends... still always unwanted duplicates...

How can I merge these datasets without any duplicates?

Thank you!

please have a look at the answer and let me know if it works for you , @Gonz87 :) — Vitali Avagyan

Alexis Alexis · Accepted Answer · 2019-10-01T21:55:22

First of all, you could use unique instead of distinct (the latter presumably from dplyr; you don't specify) to avoid coercing the data table to a data frame.

You were pretty close, but you need to switch the tables in the join, i.e. something like df2[df1], so that the rows from df1 are used as search keys, and then you can use mult to remove duplicates.

Here's one way to do what you want with a non-equi join:

setkey(df1, departure)
setkey(df2, departure)

df1[, max_departure := departure + as.difftime(1, units = "hours")
    ][, observed_departure := df2[df1,
                                  x.departure,
                                  on = .(stop_id, departure >= departure, departure <= max_departure),
                                  mult = "first"]
      ][, max_departure := NULL]

We order by departure (via setkey) so that mult = "first" returns the closest match in the future within what's allowed. The intermediate column max_departure has to be assigned and subsequently removed because non-equi joins can only use existing columns. Also note that the syntax used takes from this answer (the version with .SD instead of df1 doesn't work in this case, I don't know why).

EDIT: based on the comments, it occurs to me that when you say "duplicated", you might be referring to something different. Say you have planned departures at 10 and 10:30, but the one at 10 never takes place, and an observed departure is 10:31. Perhaps you mean that 10:31 is the observed departure for the one scheduled at 10:30, and cannot be used for the one at 10? If that's the case, perhaps this will work:

setkey(df1, departure)
setkey(df2, departure)

max_dep <- function(departure) {
  max_departure <- departure + as.difftime(1, units = "hours")

  next_departure <- shift(departure,
                          fill = max_departure[length(max_departure)] + as.difftime(1, units = "secs"),
                          type = "lead")

  invalid_max <- max_departure >= next_departure

  max_departure[invalid_max] <- next_departure[invalid_max] - as.difftime(1, units = "secs")
  max_departure
}

df1[, max_departure := max_dep(departure), by = "stop_id"
    ][, observed_departure := df2[df1,
                                  x.departure,
                                  on = .(stop_id, departure >= departure, departure <= max_departure),
                                  mult = "first"]
      ][, max_departure := NULL]

The max_dep helper checks, for each stop and scheduled departure, what would be the next scheduled departure, and sets max_departure as "next minus 1 second" if the next departure is within one hour.

The other solution wouldn't work for this because, as long as an observed departure falls within one hour of the scheduled one, it is a valid option. In my example that means 10:31 would be valid for both 10:30 and 10.

Rolling join in Data.Table by two variables without creating duplicates

1 Answers