1
votes

I am trying to join two data frames; by ID, using the data.table package in R, as well as the lubridate package for dates. I only want to join rows with a Date2 that is 3 months prior or after the Date1

df1

ID Date1
1 2019-09-09
2 2019-09-09
3 2019-09-09
4 2019-09-09

Df2

Id Date 2 Value
1 2020-09-09 6
1 2019-10-09 7
2 2019-03-13 8
4 2019-10-27 15

Final data frame that I want

ID Date1 Id Date 2 Value
1 2019-09-09 1 2020-09-09 6
2 2019-09-09 2 NA NA
3 2019-09-09 NA NA NA
4 2019-09-09 4 2019-10-27 15

I have included my code below. I am getting error saying the = after roll is not needed, but I think it is needed.

library(data.table)
library(lubridate)

df1 <- data.table(ID = seq_len(4L),
                  Date1 = as.Date("2019-09-09"))

df2 <- data.table(Id = c(1L, 1L, 2L, 4L),
                  Date2 = as.Date(c("2020-09-09","2019-10-09","2019-03-13","2019-10-27")),
                  Value = c(6L, 7L, 8L, 15L))


df2[, Date1Copy := Date2]  
df2[, IDcopy := Id] 
setkey(df2, ID, Date2)    ## set the column to perform the join on
setkey(df, ID, Date1) 
test = df1[df2, rollends = if (roll=="nearest") c(TRUE,TRUE)
           else if (roll = as.Date(Date1)%m+% months(3)|roll = as.Date(Date1)%m-% months(3)) c(FALSE,TRUE)
           else c(TRUE,FALSE)]```
1
In second row of your result table, should Id=2? It seems like if you were only joining rows that fell within the time window, Id for that record would be NA - Matt Summersgill
why not a non-equi join? - chinsoon12
@chinsoon12 I thought about the non-equi join approach, but I couldn’t figure out a way to select the “nearest” record for eachID (any simpler than the method I proposed). Do you see a way to implement both the “nearest” selection and cross-table Date comparison in a single join statement? - Matt Summersgill
@MattSummersgill I thought OP wasn’t asking for nearest. But it’s prob possible with by=.EACHI but prob slower than chainlinking [] - chinsoon12

1 Answers

1
votes

Rolling joins in data.table are extremely useful but can be a little bit tough to get the hang of. The syntax for rollends is quite a bit different than what you've got there, it's not designed to handle any kind of complex logic, just simple a TRUE/FALSE case.

At any rate, here's one way to approach this problem. Using month arithmetic as a filtering criteria in combination with the nearest value requirement makes this a multi-step problem instead of a one-step join (at least any way I can see it).

While the join/filter/copy result values operation is technically a one-liner, I did my best to add in plenty of explanation of the nested operations.

## Make a copy of Date2 to use as key, as it will be inaccessible within the joined table
df2[, Date2Copy := Date2]

## Set Keys
setkey(df1,ID,Date1)
setkey(df2,Id,Date2Copy)

## Step 3: (read the inner nested steps first!)
## After performing the steps 1/2, join the intermediate result table back to `df1`...
df1[
  ## Step 1:
  ## First use the key of `df1` to subset `df2`` with a rolling join
  df2[df1,.(ID, Date1, Date2), roll = "nearest"
      ## Step 2:
      ## Then apply the +/- 3 month filtering critera
      ][between(Date2,
                Date1 %m-% months(3),
                Date1 %m+% months(3))]
  ## Step 3:
  ## ...on the `ID` column and add the intermediate results
  ## for `Date2` and `Value` columns to `df1` by reference
  , c("Date2","Value") := .(i.Date2,i.Value), on = .(ID)]
     
## Results    
print(df1)
#    ID      Date1      Date2 Value
# 1:  1 2019-09-09 2019-10-09     7
# 2:  2 2019-09-09       <NA>    NA
# 3:  3 2019-09-09       <NA>    NA
# 4:  4 2019-09-09 2019-10-27    15

These are my three go-to resources (other than the package documentation) for rolling joins, they've all helped me understand some of the quirks at multiple points over the years.