complex data.table subset and vectorised maniulation

Question

Ok I have a complex function built using data.frames and in trying to speed it up I've turned to data.table. I'm totally new to this so I'm quite befuddled. Anyhow I've made a much much simpler toy example of what I want to do, but I cannot work out how to translate it into data.table format. Here is the example in data.frame form:

    rows <- 10
    data1 <- data.frame(   id =1:rows,
                    a = seq(0.2, 0.55, length.out = rows),
                  b = seq(0.35, 0.7, length.out = rows),
                  c = seq(0.4, 0.83, length.out = rows),
                  d = seq(0.6, 0.87, length.out = rows),
                  e = seq(0.7, 0.99, length.out = rows),
                  f = seq(0.52, 0.90, length.out = rows)             
    )
    DT1 <- data.table(data1) #for later

    data2 <- data.frame(   id =3:1,
                   a = rep(3, 3),
                   d = rep(2, 3),
                   f = rep(1, 3)
    )
    m.names <- c("a", "d", "f")

    data1[match(data2$id, data1$id),m.names] <- data1[match(data2$id, data1$id),m.names] + data2[match(data2$id, data1$id),m.names]

So note in the last step that I want perform addition between the pre-existing figures and the new data and its vectorised across several columns.

In a data.table format I've only gotten this far:

    DT1[id %in% data2$id, m.names, with=FALSE]

This selects the values I want to add to but after that I am lost. I would appreciate any help !

EDIT:

Ok I've figure out part of it - I can use the last line of code above to achieve the vectorised addition part using using data2 to store the added values as follows:

    data2[,m.names] <- data2[,m.names] + data.frame(DT1[id %in% data2$id, m.names, with=FALSE])

Even with 2.5million rows (in DT1) and 10,000 rows in data2 and 6 matching columns this only takes 0.004sec, but I still need to assign the new data2 to the appropriate dynamically assigned columns in data 1

You can do a simple binary join if you don't mind calling the column names manually, something like setkey(setDT(data1), id) ; data1[data2, ":="(a = a + i.a,d = d + i.d,f = f + i.f)] — David Arenburg
@akrun your solution is more general so I'd leave it there and let the OP decide — David Arenburg
Thanks David , but doesn't work in my real data. I can't use a,d,f I need to use m.names[1] as the number of columns can vary - using m.names[] construct doesn't seem to work.. Second point is in the toy data the match version appears quicker (and in the real code I use fmatch) — user2498193
Ok I've found a quick way to do the addition via vectorisation - see my edit - but I still need to pack the results back into data1 using the variable assigned column names ?? — user2498193

Arun Arun · Accepted Answer · 2015-03-29T19:33:34

Here's another way, using devel version data.table v1.9.5:

require(data.table) ## v1.9.5+
setDT(data1)        ## data1 is now a data.table

cols1 = c("a", "d", "f")
cols2 = paste0("i.", cols1)
setkey(data1, id)   ## setkey and prepare for join
data1[data2, (cols1) := mapply(`+`, mget(cols1), mget(cols2), SIMPLIFY=FALSE)]
#     id         a         b         c    d         e         f
#  1:  1 3.2000000 0.3500000 0.4000000 2.60 0.7000000 1.5200000
#  2:  2 3.2388889 0.3888889 0.4477778 2.63 0.7322222 1.5622222
#  3:  3 3.2777778 0.4277778 0.4955556 2.66 0.7644444 1.6044444
#  4:  4 0.3166667 0.4666667 0.5433333 0.69 0.7966667 0.6466667
#  5:  5 0.3555556 0.5055556 0.5911111 0.72 0.8288889 0.6888889
#  6:  6 0.3944444 0.5444444 0.6388889 0.75 0.8611111 0.7311111
#  7:  7 0.4333333 0.5833333 0.6866667 0.78 0.8933333 0.7733333
#  8:  8 0.4722222 0.6222222 0.7344444 0.81 0.9255556 0.8155556
#  9:  9 0.5111111 0.6611111 0.7822222 0.84 0.9577778 0.8577778
# 10: 10 0.5500000 0.7000000 0.8300000 0.87 0.9900000 0.9000000

The join of the form x[i] is performed on the key column id. For each row of data2's id column, the corresponding matching rows in data1 is found. For example, for id = 2 from data2, the matching row is the 2nd row in data1.

Once we've all the matching rows, we evaluate the expression in j, which updates the data1 columns provided in col1 by adding the values from mget(cols1) and mget(cols2).

cols2 was generated with a i. prefix, which fetches the values from the data.table i -- here data2.

HTH

complex data.table subset and vectorised maniulation

3 Answers

Benchmarks