Join two data frames using the data.table package in R based on dates that are within +/- 3 months

Question

I am trying to join two data frames; by ID, using the data.table package in R, as well as the lubridate package for dates. I only want to join rows with a Date2 that is 3 months prior or after the Date1

df1

ID	Date1
1	2019-09-09
2	2019-09-09
3	2019-09-09
4	2019-09-09

Df2

Id	Date 2	Value
1	2020-09-09	6
1	2019-10-09	7
2	2019-03-13	8
4	2019-10-27	15

Final data frame that I want

ID	Date1	Id	Date 2	Value
1	2019-09-09	1	2020-09-09	6
2	2019-09-09	2	NA	NA
3	2019-09-09	NA	NA	NA
4	2019-09-09	4	2019-10-27	15

I have included my code below. I am getting error saying the = after roll is not needed, but I think it is needed.

library(data.table)
library(lubridate)

df1 <- data.table(ID = seq_len(4L),
                  Date1 = as.Date("2019-09-09"))

df2 <- data.table(Id = c(1L, 1L, 2L, 4L),
                  Date2 = as.Date(c("2020-09-09","2019-10-09","2019-03-13","2019-10-27")),
                  Value = c(6L, 7L, 8L, 15L))


df2[, Date1Copy := Date2]  
df2[, IDcopy := Id] 
setkey(df2, ID, Date2)    ## set the column to perform the join on
setkey(df, ID, Date1) 
test = df1[df2, rollends = if (roll=="nearest") c(TRUE,TRUE)
           else if (roll = as.Date(Date1)%m+% months(3)|roll = as.Date(Date1)%m-% months(3)) c(FALSE,TRUE)
           else c(TRUE,FALSE)]```

In second row of your result table, should Id=2? It seems like if you were only joining rows that fell within the time window, Id for that record would be NA — Matt Summersgill
@chinsoon12 I thought about the non-equi join approach, but I couldn’t figure out a way to select the “nearest” record for eachID (any simpler than the method I proposed). Do you see a way to implement both the “nearest” selection and cross-table Date comparison in a single join statement? — Matt Summersgill
@MattSummersgill I thought OP wasn’t asking for nearest. But it’s prob possible with by=.EACHI but prob slower than chainlinking [] — chinsoon12

Matt Summersgill Matt Summersgill · Accepted Answer · 2021-05-17T22:25:49

Rolling joins in data.table are extremely useful but can be a little bit tough to get the hang of. The syntax for rollends is quite a bit different than what you've got there, it's not designed to handle any kind of complex logic, just simple a TRUE/FALSE case.

At any rate, here's one way to approach this problem. Using month arithmetic as a filtering criteria in combination with the nearest value requirement makes this a multi-step problem instead of a one-step join (at least any way I can see it).

While the join/filter/copy result values operation is technically a one-liner, I did my best to add in plenty of explanation of the nested operations.

## Make a copy of Date2 to use as key, as it will be inaccessible within the joined table
df2[, Date2Copy := Date2]

## Set Keys
setkey(df1,ID,Date1)
setkey(df2,Id,Date2Copy)

## Step 3: (read the inner nested steps first!)
## After performing the steps 1/2, join the intermediate result table back to `df1`...
df1[
  ## Step 1:
  ## First use the key of `df1` to subset `df2`` with a rolling join
  df2[df1,.(ID, Date1, Date2), roll = "nearest"
      ## Step 2:
      ## Then apply the +/- 3 month filtering critera
      ][between(Date2,
                Date1 %m-% months(3),
                Date1 %m+% months(3))]
  ## Step 3:
  ## ...on the `ID` column and add the intermediate results
  ## for `Date2` and `Value` columns to `df1` by reference
  , c("Date2","Value") := .(i.Date2,i.Value), on = .(ID)]
     
## Results    
print(df1)
#    ID      Date1      Date2 Value
# 1:  1 2019-09-09 2019-10-09     7
# 2:  2 2019-09-09       <NA>    NA
# 3:  3 2019-09-09       <NA>    NA
# 4:  4 2019-09-09 2019-10-27    15

These are my three go-to resources (other than the package documentation) for rolling joins, they've all helped me understand some of the quirks at multiple points over the years.

Join two data frames using the data.table package in R based on dates that are within +/- 3 months

1 Answers