Vector of Date between multiples dates

Question

frames df1 and df2. I want to add to df1 the version column that comes from the df2 dataframe. The condition is that the Date column of df1 is between df2.Date1 and df2.Date, and accordingly bring the corresponding version.

For example, the first date of df1 is 2020-03-25, it is between 2020-01-01 and 2020-03-25, therefore the version that corresponds to it is CZ01. Thanks for any help.

My df1

library(dplyr)
library(data.table)

set.seed(1);vec1 <- sample(seq(as.Date('2020-01-01'), as.Date('2020-12-31'), by="day"), 10)
set.seed(2);vec2 <- sample(seq(as.Date('2020-01-01'), as.Date('2020-12-31'), by="day"), 5)

df1 <- data.frame(ID = c(rep(1, 10),rep(2, 5)), Date = c(vec1, vec2)) %>%
       arrange(ID, Date)
   ID       Date
1   1 2020-03-25
2   1 2020-05-08
3   1 2020-06-15
4   1 2020-07-05
5   1 2020-09-26
6   1 2020-10-03
7   1 2020-10-25
8   1 2020-11-02
9   1 2020-11-19
10  1 2020-11-25
11  2 2020-07-16
12  2 2020-09-18
13  2 2020-09-29
14  2 2020-12-06
15  2 2020-12-14

My df2

df2 <- data.frame(ID = c(1,1,1,2,2), Version = c('CZ01','CZ02','CZ03','BX01','BX02'),
                  Date1 = as.Date(c('2020-01-01','2020-03-25','2020-07-02','2020-07-16','2020-10-31')),
                  Date2 = as.Date(c('2020-03-25','2020-07-02','2020-12-31','2020-10-31','2020-12-31')))
df2
  ID Version      Date1      Date2
1  1    CZ01 2020-01-01 2020-03-25
2  1    CZ02 2020-03-25 2020-07-02
3  1    CZ03 2020-07-02 2020-12-31
4  2    BX01 2020-07-16 2020-10-31
5  2    BX02 2020-10-31 2020-12-31

Expected output

   ID       Date Version
1   1 2020-03-25    CZ01
2   1 2020-05-08    CZ02
3   1 2020-06-15    CZ02
4   1 2020-07-05    CZ03
5   1 2020-09-26    CZ03
6   1 2020-10-03    CZ03
7   1 2020-10-25    CZ03
8   1 2020-11-02    CZ03
9   1 2020-11-19    CZ03
10  1 2020-11-25    CZ03
11  2 2020-07-16    BX01
12  2 2020-09-18    BX01
13  2 2020-09-29    BX01
14  2 2020-12-06    BX02
15  2 2020-12-14    BX02

r2evans r2evans · Accepted Answer · 2021-07-15T18:44:17

First, we can try a direct non-equi join (with both non-strict inequalities). We'll see that row 1 is repeated:

df2[df1, on = .(ID, Date1 <= Date, Date2 >= Date)
  ][, c("Date", "Date1", "Date2") := .(Date1, NULL, NULL)]
#        ID Version       Date
#     <num>  <char>     <Date>
#  1:     1    CZ01 2020-03-25
#  2:     1    CZ02 2020-03-25
#  3:     1    CZ02 2020-05-08
#  4:     1    CZ02 2020-06-15
#  5:     1    CZ03 2020-07-05
#  6:     1    CZ03 2020-09-26
#  7:     1    CZ03 2020-10-03
#  8:     1    CZ03 2020-10-25
#  9:     1    CZ03 2020-11-02
# 10:     1    CZ03 2020-11-19
# 11:     1    CZ03 2020-11-25
# 12:     2    BX01 2020-07-16
# 13:     2    BX01 2020-09-18
# 14:     2    BX01 2020-09-29
# 15:     2    BX02 2020-12-06
# 16:     2    BX02 2020-12-14

This can perhaps be remedied with aggregation (i.e., take the first non-NA Version), assuming we have a way to identify the original rows uniquely. I'll assume that ID and Date are unique, we can do

df2[df1, on = .(ID, Date1 <= Date, Date2 >= Date)
  ][, c("Date", "Date1", "Date2") := .(Date1, NULL, NULL)
  ][, .SD[which.min(is.na(Version)),], by = .(ID, Date)]
#        ID       Date Version
#     <num>     <Date>  <char>
#  1:     1 2020-03-25    CZ01
#  2:     1 2020-05-08    CZ02
#  3:     1 2020-06-15    CZ02
#  4:     1 2020-07-05    CZ03
#  5:     1 2020-09-26    CZ03
#  6:     1 2020-10-03    CZ03
#  7:     1 2020-10-25    CZ03
#  8:     1 2020-11-02    CZ03
#  9:     1 2020-11-19    CZ03
# 10:     1 2020-11-25    CZ03
# 11:     2 2020-07-16    BX01
# 12:     2 2020-09-18    BX01
# 13:     2 2020-09-29    BX01
# 14:     2 2020-12-06    BX02
# 15:     2 2020-12-14    BX02

An alternative approach is the assignment-join using non-equi join columns (which happens to reverse the order of data.table operands), but we run into another concern: since df1$Date[1] is the same as df2$Date2[1] and df2$Date1[2], that suggests that the non-equi join should be strict < on one side and not-strict >= on the other. If you do it so that "CZ01" is found for the first line, then df1$Date[11] will not fit within a range defined in df2.

library(data.table)
setDT(df1)
setDT(df2)

df1[df2, Version := Version, on = .(ID, Date > Date1, Date <= Date2)]
#        ID       Date Version
#     <num>     <Date>  <char>
#  1:     1 2020-03-25    CZ01
#  2:     1 2020-05-08    CZ02
#  3:     1 2020-06-15    CZ02
#  4:     1 2020-07-05    CZ03
#  5:     1 2020-09-26    CZ03
#  6:     1 2020-10-03    CZ03
#  7:     1 2020-10-25    CZ03
#  8:     1 2020-11-02    CZ03
#  9:     1 2020-11-19    CZ03
# 10:     1 2020-11-25    CZ03
# 11:     2 2020-07-16    <NA>
# 12:     2 2020-09-18    BX01
# 13:     2 2020-09-29    BX01
# 14:     2 2020-12-06    BX02
# 15:     2 2020-12-14    BX02

We can fill in the NA with

tmp <- df1[is.na(Version),
  ][, Version := NULL
  ][df2, Version := Version, on = .(ID, Date >= Date1, Date <= Date2)]
tmp
#       ID       Date Version
#    <num>     <Date>  <char>
# 1:     2 2020-07-16    BX01

rbindlist(list(df1[ !is.na(Version), ], tmp))
#        ID       Date Version
#     <num>     <Date>  <char>
#  1:     1 2020-03-25    CZ01
#  2:     1 2020-05-08    CZ02
#  3:     1 2020-06-15    CZ02
#  4:     1 2020-07-05    CZ03
#  5:     1 2020-09-26    CZ03
#  6:     1 2020-10-03    CZ03
#  7:     1 2020-10-25    CZ03
#  8:     1 2020-11-02    CZ03
#  9:     1 2020-11-19    CZ03
# 10:     1 2020-11-25    CZ03
# 11:     2 2020-09-18    BX01
# 12:     2 2020-09-29    BX01
# 13:     2 2020-12-06    BX02
# 14:     2 2020-12-14    BX02
# 15:     2 2020-07-16    BX01

Vector of Date between multiples dates

My df1

My df2

Expected output

2 Answers