1
votes

frames df1 and df2. I want to add to df1 the version column that comes from the df2 dataframe. The condition is that the Date column of df1 is between df2.Date1 and df2.Date, and accordingly bring the corresponding version.

For example, the first date of df1 is 2020-03-25, it is between 2020-01-01 and 2020-03-25, therefore the version that corresponds to it is CZ01. Thanks for any help.

My df1

library(dplyr)
library(data.table)

set.seed(1);vec1 <- sample(seq(as.Date('2020-01-01'), as.Date('2020-12-31'), by="day"), 10)
set.seed(2);vec2 <- sample(seq(as.Date('2020-01-01'), as.Date('2020-12-31'), by="day"), 5)

df1 <- data.frame(ID = c(rep(1, 10),rep(2, 5)), Date = c(vec1, vec2)) %>%
       arrange(ID, Date)
   ID       Date
1   1 2020-03-25
2   1 2020-05-08
3   1 2020-06-15
4   1 2020-07-05
5   1 2020-09-26
6   1 2020-10-03
7   1 2020-10-25
8   1 2020-11-02
9   1 2020-11-19
10  1 2020-11-25
11  2 2020-07-16
12  2 2020-09-18
13  2 2020-09-29
14  2 2020-12-06
15  2 2020-12-14

My df2

df2 <- data.frame(ID = c(1,1,1,2,2), Version = c('CZ01','CZ02','CZ03','BX01','BX02'),
                  Date1 = as.Date(c('2020-01-01','2020-03-25','2020-07-02','2020-07-16','2020-10-31')),
                  Date2 = as.Date(c('2020-03-25','2020-07-02','2020-12-31','2020-10-31','2020-12-31')))
df2
  ID Version      Date1      Date2
1  1    CZ01 2020-01-01 2020-03-25
2  1    CZ02 2020-03-25 2020-07-02
3  1    CZ03 2020-07-02 2020-12-31
4  2    BX01 2020-07-16 2020-10-31
5  2    BX02 2020-10-31 2020-12-31

Expected output

   ID       Date Version
1   1 2020-03-25    CZ01
2   1 2020-05-08    CZ02
3   1 2020-06-15    CZ02
4   1 2020-07-05    CZ03
5   1 2020-09-26    CZ03
6   1 2020-10-03    CZ03
7   1 2020-10-25    CZ03
8   1 2020-11-02    CZ03
9   1 2020-11-19    CZ03
10  1 2020-11-25    CZ03
11  2 2020-07-16    BX01
12  2 2020-09-18    BX01
13  2 2020-09-29    BX01
14  2 2020-12-06    BX02
15  2 2020-12-14    BX02
2

2 Answers

1
votes

First, we can try a direct non-equi join (with both non-strict inequalities). We'll see that row 1 is repeated:

df2[df1, on = .(ID, Date1 <= Date, Date2 >= Date)
  ][, c("Date", "Date1", "Date2") := .(Date1, NULL, NULL)]
#        ID Version       Date
#     <num>  <char>     <Date>
#  1:     1    CZ01 2020-03-25
#  2:     1    CZ02 2020-03-25
#  3:     1    CZ02 2020-05-08
#  4:     1    CZ02 2020-06-15
#  5:     1    CZ03 2020-07-05
#  6:     1    CZ03 2020-09-26
#  7:     1    CZ03 2020-10-03
#  8:     1    CZ03 2020-10-25
#  9:     1    CZ03 2020-11-02
# 10:     1    CZ03 2020-11-19
# 11:     1    CZ03 2020-11-25
# 12:     2    BX01 2020-07-16
# 13:     2    BX01 2020-09-18
# 14:     2    BX01 2020-09-29
# 15:     2    BX02 2020-12-06
# 16:     2    BX02 2020-12-14

This can perhaps be remedied with aggregation (i.e., take the first non-NA Version), assuming we have a way to identify the original rows uniquely. I'll assume that ID and Date are unique, we can do

df2[df1, on = .(ID, Date1 <= Date, Date2 >= Date)
  ][, c("Date", "Date1", "Date2") := .(Date1, NULL, NULL)
  ][, .SD[which.min(is.na(Version)),], by = .(ID, Date)]
#        ID       Date Version
#     <num>     <Date>  <char>
#  1:     1 2020-03-25    CZ01
#  2:     1 2020-05-08    CZ02
#  3:     1 2020-06-15    CZ02
#  4:     1 2020-07-05    CZ03
#  5:     1 2020-09-26    CZ03
#  6:     1 2020-10-03    CZ03
#  7:     1 2020-10-25    CZ03
#  8:     1 2020-11-02    CZ03
#  9:     1 2020-11-19    CZ03
# 10:     1 2020-11-25    CZ03
# 11:     2 2020-07-16    BX01
# 12:     2 2020-09-18    BX01
# 13:     2 2020-09-29    BX01
# 14:     2 2020-12-06    BX02
# 15:     2 2020-12-14    BX02

An alternative approach is the assignment-join using non-equi join columns (which happens to reverse the order of data.table operands), but we run into another concern: since df1$Date[1] is the same as df2$Date2[1] and df2$Date1[2], that suggests that the non-equi join should be strict < on one side and not-strict >= on the other. If you do it so that "CZ01" is found for the first line, then df1$Date[11] will not fit within a range defined in df2.

library(data.table)
setDT(df1)
setDT(df2)

df1[df2, Version := Version, on = .(ID, Date > Date1, Date <= Date2)]
#        ID       Date Version
#     <num>     <Date>  <char>
#  1:     1 2020-03-25    CZ01
#  2:     1 2020-05-08    CZ02
#  3:     1 2020-06-15    CZ02
#  4:     1 2020-07-05    CZ03
#  5:     1 2020-09-26    CZ03
#  6:     1 2020-10-03    CZ03
#  7:     1 2020-10-25    CZ03
#  8:     1 2020-11-02    CZ03
#  9:     1 2020-11-19    CZ03
# 10:     1 2020-11-25    CZ03
# 11:     2 2020-07-16    <NA>
# 12:     2 2020-09-18    BX01
# 13:     2 2020-09-29    BX01
# 14:     2 2020-12-06    BX02
# 15:     2 2020-12-14    BX02

We can fill in the NA with

tmp <- df1[is.na(Version),
  ][, Version := NULL
  ][df2, Version := Version, on = .(ID, Date >= Date1, Date <= Date2)]
tmp
#       ID       Date Version
#    <num>     <Date>  <char>
# 1:     2 2020-07-16    BX01

rbindlist(list(df1[ !is.na(Version), ], tmp))
#        ID       Date Version
#     <num>     <Date>  <char>
#  1:     1 2020-03-25    CZ01
#  2:     1 2020-05-08    CZ02
#  3:     1 2020-06-15    CZ02
#  4:     1 2020-07-05    CZ03
#  5:     1 2020-09-26    CZ03
#  6:     1 2020-10-03    CZ03
#  7:     1 2020-10-25    CZ03
#  8:     1 2020-11-02    CZ03
#  9:     1 2020-11-19    CZ03
# 10:     1 2020-11-25    CZ03
# 11:     2 2020-09-18    BX01
# 12:     2 2020-09-29    BX01
# 13:     2 2020-12-06    BX02
# 14:     2 2020-12-14    BX02
# 15:     2 2020-07-16    BX01
1
votes

You can use the following solution. Here is some notes on how it works:

  • I first used map2 function from purrr as I needed to iterate over ID and Date variables in each row of df1
  • In the next step I chose all variables of df2 in each row which are in fact ID, Date1 & Date2 respectively which is a numeric vector of length 3 with ..1 variable in pmap referring to the first and so on
  • You may have noticed that I again converted your dates with as.Date in pmap, that's because when I gathered all columns except for Version they are all coerced to numeric
  • In the end I check whether our Date in each row of df1 falls into any category and also took into account the matching of IDs
  • Both will result in a logical vector of length 5 and I chose the first one that returns TRUE for either of them
  • One more thing to take into account is, since some of your Date values may fall into more than one categories like the first row I used [1] to only choose the first one
library(dplyr)
library(purrr)

df1 %>%
  mutate(Version = map2(ID, Date, function(a, b) {
    df2$Version[pmap_lgl(df2 %>% select(!Version), ~ between(b, as.Date(..2), as.Date(..3)) &
                       a == ..1)][1]
  }))

   ID       Date Version
1   1 2020-03-25    CZ01
2   1 2020-05-08    CZ02
3   1 2020-06-15    CZ02
4   1 2020-07-05    CZ03
5   1 2020-09-26    CZ03
6   1 2020-10-03    CZ03
7   1 2020-10-25    CZ03
8   1 2020-11-02    CZ03
9   1 2020-11-19    CZ03
10  1 2020-11-25    CZ03
11  2 2020-07-16    BX01
12  2 2020-09-18    BX01
13  2 2020-09-29    BX01
14  2 2020-12-06    BX02
15  2 2020-12-14    BX02