3
votes

Stealing a dummy example from elsewhere on SO (Join data.table on exact date or if not the case on the nearest less than date), I'm looking to join two tables based on the first date (date in Dt1) being strictly earlier than the second date (date in Dt2).

Also turned off the 'warning' message from the 'slide' function for the DataCombine solution, as it was probably unfairly slowing down mtotos solution.

library(data.table)

Dt1 <- read.table(text="
date      x
1/26/2010,  10  
1/25/2010,  9  
1/24/2010,  9   
1/22/2010,  7    
1/19/2010,  11", header=TRUE, stringsAsFactors=FALSE)

Dt2 <- read.table(text="
date
1/26/2010   
1/23/2010   
1/20/2010", header=TRUE, stringsAsFactors=FALSE)

Desired result from join

   date     x  
1/26/2010 - 9 # based on closest observation strictly less than date  
1/23/2010 - 7   
1/20/2010 - 11

Timings of two solutions

(I keep the data.frame format for input to mtoto's solution, and data.table for jangorecki's).

solution.mtoto = function(Df1, Df2)
{
  #Full outer join of two df's
  merged <- merge(Df1, Df2, by = "date", all = T, sort=T)

  # Shifting values backwards by one using 'slide' from DataCombine
  merged <- slide(merged, Var = "x", slideBy = -1, reminder = F)

  # Inner join retaining the relevant cols
  return(merge(Df2,merged)[,-2])
}

solution.jangorecki = function(Dt1, Dt2)
{
  offset.roll.join = function(Dt1, Dt2){
    Dt2[, jndate := date - 1L] # produce join column with offset
    on.exit(Dt2[, jndate := NULL]) # cleanup join col on exit
    Dt1[Dt2, .(date = i.date, x), on = c("date" = "jndate"), roll = Inf] # do rolling join
  }
  return(offset.roll.join(Dt1, Dt2))
}

res.mtoto = sapply(1:10, FUN = function(x){system.time({solution.mtoto(Df1, Df2)})})

res.jangorecki = sapply(1:10, FUN = function(x){system.time({solution.jangorecki(Dt1, Dt2)})})


> res.mtoto[c("user.self", "sys.self"),]
           [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
user.self 0.004 0.004 0.004 0.004 0.003 0.003 0.003 0.003 0.003 0.003
sys.self  0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

> res.jangorecki[c("user.self", "sys.self"),]
           [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
user.self 0.005 0.005 0.004 0.004 0.005 0.004 0.004 0.004 0.003 0.004
sys.self  0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

Edit, accidentally referred to Dt1 instead of Df1 in mtoto's solution. Now fixed.

Similar speed (may be more noticeable on bigger datasets?). My additional problem is that I would like the date returned in the second table.

For example, the desired result would be:

date - x - date2
1/26/2010 - 9 - 1/25/2010
1/23/2010 - 7 - 1/22/2010
1/20/2010 - 11 - 1/19/2010
2
you can subtract integer 1 from date and continue with rolling join as in linked question, or add 1 into second table's date.jangorecki
Full outer join can be very costly, depends how big the gaps in bigger sets would be. Timing on just few rows won't show anything.jangorecki

2 Answers

4
votes

Rolling join with -1L offset.

Update 2016-04-02: With this commit in current devel, v1.9.7, this can be done without creating a temporary column. From NEWS:

x's columns can be referred to in j using the prefix x. at all times. This is particularly useful when it is necessary to x's column that is also a join column. This is a patch addressing #1615.

Dt2[, jndate := date - 1L]
Dt1[Dt2,
    .(date = i.date, orgdate = x.date, x),
    on = c("date" = "jndate"),
    roll = Inf]
#         date    orgdate  x
#1: 2010-01-26 2010-01-25  9
#2: 2010-01-23 2010-01-22  7
#3: 2010-01-20 2010-01-19 11

Original answer, useful if you are on 1.9.6 or older.

library(data.table)

# data
Dt1 = fread("date      x
1/26/2010,  10  
1/25/2010,  9  
1/24/2010,  9   
1/22/2010,  7    
1/19/2010,  11")[, date := as.IDate(date, format=("%m/%d/%Y"))][]
Dt2 = fread("date
1/26/2010   
1/23/2010   
1/20/2010")[, date := as.IDate(date, format=("%m/%d/%Y"))][]

# solution
offset.roll.join = function(Dt1, Dt2){
    Dt2[, jndate := date - 1L] # produce join column with offset
    Dt1[, orgdate := date] # should not be needed after data.table#1615
    on.exit({Dt2[, jndate := NULL]; Dt1[, orgdate := NULL]}) # cleanup on exit
    Dt1[Dt2, .(date = i.date, orgdate, x), on = c("date" = "jndate"), roll = Inf] # do rolling join
}
offset.roll.join(Dt1, Dt2)
#         date    orgdate  x
#1: 2010-01-26 2010-01-25  9
#2: 2010-01-23 2010-01-22  7
#3: 2010-01-20 2010-01-19 11
1
votes

In three steps:

library(DataCombine)

#Full outer join of two df's
merged <- merge(Dt1, Dt2, by = "date", all = T)

# Shifting values backwards by one using 'slide' from DataCombine
merged <- slide(merged, Var = "x", slideBy = -1)

# Inner join retaining the relevant cols
merge(Dt2,merged)[,-2]
#       date x-1
#1 1/20/2010  11
#2 1/23/2010   7
#3 1/26/2010   9