Stealing a dummy example from elsewhere on SO (Join data.table on exact date or if not the case on the nearest less than date), I'm looking to join two tables based on the first date (date in Dt1) being strictly earlier than the second date (date in Dt2).
Also turned off the 'warning' message from the 'slide' function for the DataCombine solution, as it was probably unfairly slowing down mtotos solution.
library(data.table)
Dt1 <- read.table(text="
date x
1/26/2010, 10
1/25/2010, 9
1/24/2010, 9
1/22/2010, 7
1/19/2010, 11", header=TRUE, stringsAsFactors=FALSE)
Dt2 <- read.table(text="
date
1/26/2010
1/23/2010
1/20/2010", header=TRUE, stringsAsFactors=FALSE)
Desired result from join
date x
1/26/2010 - 9 # based on closest observation strictly less than date
1/23/2010 - 7
1/20/2010 - 11
Timings of two solutions
(I keep the data.frame format for input to mtoto's solution, and data.table for jangorecki's).
solution.mtoto = function(Df1, Df2)
{
#Full outer join of two df's
merged <- merge(Df1, Df2, by = "date", all = T, sort=T)
# Shifting values backwards by one using 'slide' from DataCombine
merged <- slide(merged, Var = "x", slideBy = -1, reminder = F)
# Inner join retaining the relevant cols
return(merge(Df2,merged)[,-2])
}
solution.jangorecki = function(Dt1, Dt2)
{
offset.roll.join = function(Dt1, Dt2){
Dt2[, jndate := date - 1L] # produce join column with offset
on.exit(Dt2[, jndate := NULL]) # cleanup join col on exit
Dt1[Dt2, .(date = i.date, x), on = c("date" = "jndate"), roll = Inf] # do rolling join
}
return(offset.roll.join(Dt1, Dt2))
}
res.mtoto = sapply(1:10, FUN = function(x){system.time({solution.mtoto(Df1, Df2)})})
res.jangorecki = sapply(1:10, FUN = function(x){system.time({solution.jangorecki(Dt1, Dt2)})})
> res.mtoto[c("user.self", "sys.self"),]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
user.self 0.004 0.004 0.004 0.004 0.003 0.003 0.003 0.003 0.003 0.003
sys.self 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
> res.jangorecki[c("user.self", "sys.self"),]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
user.self 0.005 0.005 0.004 0.004 0.005 0.004 0.004 0.004 0.003 0.004
sys.self 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Edit, accidentally referred to Dt1 instead of Df1 in mtoto's solution. Now fixed.
Similar speed (may be more noticeable on bigger datasets?). My additional problem is that I would like the date returned in the second table.
For example, the desired result would be:
date - x - date2
1/26/2010 - 9 - 1/25/2010
1/23/2010 - 7 - 1/22/2010
1/20/2010 - 11 - 1/19/2010
1
fromdate
and continue with rolling join as in linked question, or add1
into second table'sdate
. – jangorecki