4
votes

I have a data set indexed by two ID variables (one nested in the other) and date, and I wish to calculate a rolling statistic in this data.

My real dataset is large (~200 mil rows), and I have enjoyed the speed gains using data.table on other tasks...but I cannot figure out a way to use data.table optimally(i.e. leverage binary search and avoid a vector scan) in this problem.

Sample Data:

set.seed(3)
dt1 <- 
 data.table(id1=c(rep("a",124),rep("b",124)),
            id2=c(rep("x",62),rep("y",62)),
            date=seq(as.Date("2012-05-01"),as.Date("2012-07-01"),"days"),
            var1=rpois(124,14),
            var2=rpois(124,3))
setkey(dt1,id1,id2,date)
# create uneven time spacing
dt1 <- dt1[-c(5,10,36,46,58)]

My end goal is to calculate a "rolling statistic" for each day within id1/id2, which is:

sum(var2)/sum(var1)

including all other rows with the same id1/id2 combination, and in the 30 days prior to that row.

I am not sure this is a good first step, but for clarity's sake here is the code to get my desired result for the all ID's on Date=2012-06-12:

dt1[date < as.Date("2012-06-12") & date > as.Date("2012-06-12")-31,
    list("newstat"=sum(var1)/sum(var2),
         "date"=as.Date("2012-06-12")),by=list(id1,id2)]

   id1 id2 newstat       date
1:   a   x   3.925 2012-06-12
2:   a   y   4.396 2012-06-12
3:   b   x   3.925 2012-06-12
4:   b   y   4.396 2012-06-12

I thought of trying to do a Cartesian self join on id1 and id2, and then reduce my resulting data.table down to the appropriate date range (I don't know how to do that in one step). This gives me the desired result, however, I am not sure how to do that without using a logical subset as the argument to i on the reduction step, which is unacceptably slow. I think I am failing to leverage the keys of the data.table at that point, but not sure how to solve this...

Example:

dt1[setkey(dt1[,list(id1,id2,"date_grp"=date)],id1,id2),
    list(date_grp,date,var1,var2)][
      # Here comes slow subset
      date<date_grp & date > date_grp-30,
      list("newstat"=sum(var1)/sum(var2)),
      by=list(id1,id2,date_grp)]

Result:

     id1 id2   date_grp newstat
  1:   a   x 2012-05-02  0.4286
  2:   a   x 2012-05-03  0.4000
  3:   a   x 2012-05-04  0.2857
  4:   a   x 2012-05-06  0.2903
  5:   a   x 2012-05-07  0.3056
 ---                           
235:   b   y 2012-06-27  0.2469
236:   b   y 2012-06-28  0.2354
237:   b   y 2012-06-29  0.2323
238:   b   y 2012-06-30  0.2426
239:   b   y 2012-07-01  0.2304
1

1 Answers

6
votes

I'm sure there is a better way to do this, but one thing you can do is avoid the full cartesian join which is what's killing you by generating an interim table with the join keys:

dt.dates <- dt1[, list(date.join=seq(as.Date(date - 1, origin="1970-01-01"), by="-1 day", len=30)), by=list(date, id1, id2)]

For each date-id group, we've now generated the list of allowable join dates. Now we join back to the data and compute our metric.

setkey(dt.dates, date.join, id1, id2)
setkey(dt1,date,id1,id2)
dt.dates[dt1][ , sum(var1)/sum(var2), by=list(id1, id2, date)]

I couldn't replicate your result for 6/12, but I think we have a seeding issue. Compare:

> dt.date.join[dt1][ , sum(var1)/sum(var2), by=list(id1, id2, date)][date=="2012-06-12"]
   id1 id2       date       V1
1:   a   x 2012-06-12 3.630631
2:   a   y 2012-06-12 4.434783
3:   b   x 2012-06-12 3.634783
4:   b   y 2012-06-12 4.434783
> dt1[date < as.Date("2012-06-12") & date > as.Date("2012-06-12")-31, list("newstat"=sum(var1)/sum(var2), "date"=as.Date("2012-06-12")),by=list(id1,id2)]
   id1 id2  newstat       date
1:   a   x 3.630631 2012-06-12
2:   a   y 4.434783 2012-06-12
3:   b   x 3.634783 2012-06-12
4:   b   y 4.434783 2012-06-12

Basically the same result.