1
votes

I have two data.tables:

library(data.table) # 1.9.6

DT1 <- data.table(id = rep(c("a", "b"), c(3, 2)),
                  x  = c(6,12,15,37,41),
                  y  = 1)
DT1[]

DT2 <- data.table(id = rep(c("a", "b"), c(4, 5)),
                  x = c(2,7,11,15,27,30, 35, 38,40))
DT2[]

I would like to do a rolling join with the tables and update DT2 by reference with the value of y from DT1. This almost does what I want:

DT2[DT1, y := y, on = c("id", "x"), roll = -Inf]
DT2[]

However, I noticed that after this operation

DT2[4]

has y = 1, whereas, if I do the rolling join either a) without row 2 of DT1 OR without row 3 of DT1 I always get the same result since in both cases, the y from DT1 is joined to row 4 in DT2 - example:

DT2[, y:= NULL]
DT2[DT1[!2], y := y, on = c("id", "x"), roll = -Inf]
DT2[]
DT2[, y:= NULL]
DT2[DT1[!3], y := y, on = c("id", "x"), roll = -Inf]
DT2[]

What I would like to achieve as a result is

DT2[4, y := 2][]

i.e. where the join leads to multiple values in a "cell", I want the total / aggregate value. Is this possible using data.table's rolling join and update-by-reference functionality?

1
The newest stable version is 1.10.0 JFYDavid Arenburg
@DavidArenburg updating packages is not trivial in production environmemts especilly with potentially breaking changes.talat
@docendodiscimus you are probably aware, but still worth to mention... you can create new dir and use it as library path, install latest version there, and load it explicitly from that library only when you need it.jangorecki

1 Answers

2
votes

An ugly workaround is to build the aggregate table and then merge a second time on that:

agg = DT2[DT1, on=.(id, x), roll=-Inf, .(id, x = x.x, y = i.y)][, sum(y), by=.(id, x)]

DT2[agg, on=.(id, x), roll=-Inf, y := i.V1 ]

There's an open FR that I think would allow DT2[DT1, on=.(id,x), y := sum(i.y), by=x.x].