1
votes

I have two data frames of different lengths. Both data frames contain dates. I would like to inner join the closest date column without going over. For example, if "2019-01-05" is the date and the two choices are "2019-01-06" and "2019-01-02", it joins with "2019-01-02". I tried using roll "inf" from data.table package but I am not getting the desired result. Here are 2 toy data frames that I want to join by date.

I would also like to group the join by the ticker column as well. In tidyverse, the code would be similar to:

df1 %>% group_by(ticker) %>% inner_join(df2, by = "Date") #but choose the closest date without going over.


df1 = data.frame(ticker = c("AAPL", "AAPL", "MSFT", "MSFT"), date = c("2019-01-06", "2019-02-06", "2019-01-06", "2019-05-02"))

df2 = data.frame(ticker = c("AAPL", "AAPL", "AAPL", "MSFT", "MSFT", "MSFT"), date = c("2019-01-03", "2019-01-07" , "2019-02-06", "2019-01-05", "2019-01-07", "2019-05-02"), randomVar = rnorm(6))

print(df1)
 ticker       date
1   AAPL 2019-01-06
2   AAPL 2019-02-06
3   MSFT 2019-01-06
4   MSFT 2019-05-02


print(df2)
 ticker       date  randomVar
1   AAPL 2019-01-03 -0.5321493
2   AAPL 2019-01-07 -0.7909461
3   AAPL 2019-02-06  0.2121993
4   MSFT 2019-01-05  1.2336315
5   MSFT 2019-01-07 -0.2729354
6   MSFT 2019-05-02 -0.5349596

I would like to create df3 which is a join between df1 and df2. The date key column should only be the ones from the df1 date column.

In this case, our new df3 should look exactly like this.

 print(df3)

 ticker       date   randomVar
1   AAPL 2019-01-06  -0.5321493
2   AAPL 2019-02-06   0.2121993
3   MSFT 2019-01-06   1.2336315
4   MSFT 2019-05-02   -0.5349596
2
See "rolling joins" in the data.table package. Tutorial here. - eipi10
fuzzyjoin could also help this one I think - link here - Sinh Nguyen

2 Answers

6
votes

There is already an accepted answer, but since the data.table tag is present, here is a data.table solution:

#convert sample data to data.table
setDT(df1)
setDT(df2)
#convert dates to 'real' dates
df1[, date := as.IDate(date) ]
df2[, date := as.IDate(date) ]
#update df1 by reference with a rolling join
df1[, randomVar := df2[ df1, x.randomVar, on = .(ticker, date), roll = Inf ] ]

#    ticker       date  randomVar
# 1:   AAPL 2019-01-06 -0.5321493
# 2:   AAPL 2019-02-06  0.2121993
# 3:   MSFT 2019-01-06  1.2336315
# 4:   MSFT 2019-05-02 -0.5349596
5
votes

This can be done in SQL with the default SQLite backend using left join on ticker and on df2 date being less than or equal to the df1 date and then grouping over df1 and taking the max date from df2 of those joined to df1.

library(sqldf)
sqldf("select df1.*, max(df2.date), df2.randomVar from df1 
  left join df2 on df1.ticker = df2.ticker and df1.date >= df2.date
  group by df1.rowid
  order by df1.rowid")[-3]

giving:

  ticker       date  randomVar
1   AAPL 2019-01-06 -0.5321493
2   AAPL 2019-02-06  0.2121993
3   MSFT 2019-01-06  1.2336315
4   MSFT 2019-05-02 -0.5349596

Note

Inputs in reproducible form:

Lines1 <- "ticker       date
1   AAPL 2019-01-06
2   AAPL 2019-02-06
3   MSFT 2019-01-06
4   MSFT 2019-05-02"


Lines2 <- "ticker       date  randomVar
1   AAPL 2019-01-03 -0.5321493
2   AAPL 2019-01-07 -0.7909461
3   AAPL 2019-02-06  0.2121993
4   MSFT 2019-01-05  1.2336315
5   MSFT 2019-01-07 -0.2729354
6   MSFT 2019-05-02 -0.5349596"

df1 <- read.table(text = Lines1, as.is = TRUE)
df2 <- read.table(text = Lines2, as.is = TRUE)