12
votes

Suppose I have a data.table

a <- data.table(id=c(1,1,2,2,3),a=21:25,b=11:15,key="id")

I can add new columns like this:

a[, sa := sum(a), by="id"]
a[, sb := sum(b), by="id"]
> a
   id  a  b sa sb
1:  1 21 11 43 23
2:  1 22 12 43 23
3:  2 23 13 47 27
4:  2 24 14 47 27
5:  3 25 15 25 15

However, suppose that I have column names instead:

for (n in c("a","b")) {
  s <- paste0("s",n)
  a[, s := sum(n), by="id", with=FALSE] # ERROR: invalid 'type' (character) of argument
}

what do I do?

4
...and 1.6 from the data.table FAQ. - joran
@joran: I don't see how I could use eval/quote here; I tried s <- quote(paste0("s..",n) := sum(n)); a[, eval(s), by="id"] and failed. - sds
@sds I was just brainstorming. I sympathize, though. The level of confusion data.table creates for me with this type of problem is the single biggest reason I don't use it more often. - joran

4 Answers

11
votes

You can also do this:

a <- data.table(id=c(1,1,2,2,3),a=21:25,b=11:15,key="id")

a[, c("sa", "sb") := lapply(.SD, sum), by = id]

Or slightly more generally:

cols.to.sum = c("a", "b")
a[, paste0("s", cols.to.sum) := lapply(.SD, sum), by = id, .SDcols = cols.to.sum]
8
votes

This is similar to :

How to generate a linear combination of variables and update table using data.table in a loop call?

but you want to combine this with by= too, so set() isn't flexible enough. That's a deliberate design design and set() is unlikely to change in that regard.

I sometimes use the EVAL helper at the end of that answer.
https://stackoverflow.com/a/20808573/403310
Some wince at that approach but I just think of it like constructing a dynamic SQL statement, which is quite common practice. The EVAL approach gives ultimate flexibility without head scratching about eval() and quote(). To see the dynamic query that's been constructed (to check it) you can add a print inside your EVAL helper function.

However, in this simple example you can wrap the LHS of := with brackets to tell data.table to lookup the value (clearer than with=FALSE), and the RHS needs a get().

for (n in c("a","b")) {
  s <- paste0("s",n)
  a[, (s) := sum(get(n)), by="id"]
}
2
votes

Edit 2020-02-15 about ..

data.table also supports the .. syntax to "look up a level", obviating the need for with=FALSE in most cases, e.g. dt[ , ..n1] and dt[ , ..n2] in the below


have a look at with in ? data.table:

dt <- data.table(id=1:5,a=21:25,b=11:15,key="id")
dt[, n3 := dt[ , n1, with = FALSE ] * dt[ , n2, with = FALSE ], with = FALSE ]

EDIT:

Or you just change the colnames forth and back:

dt <- data.table(id=1:5,a=21:25,b=11:15,key="id")
dt[ , dt.names["n3"] := 1L, with = FALSE ]

dt.names <- c( n1 = "a", n2 = "b", n3 = "c" )
setnames( dt, dt.names, names(dt.names) )

dt[ , n3 := n1 * n2, by = "id" ]
setnames( dt, names(dt.names), dt.names )

which works together with by.

0
votes

Here is an approach that does the call mangling and avoids any overhead with .SD

# a helper function
makeCall <- function(x,fun) bquote(.(fun)(.(x)))
# the columns you wish to sum (apply function to)
cols <- c('a','b')
new.cols <- paste0('s',cols)
# create named list of names
name.cols <- setNames(sapply(cols,as.name), new.cols)
# create the call
my_call <-  as.call(c(as.name(':='), lapply(name.cols, makeCall, fun = as.name('sum'))))
(a[, eval(my_call), by = 'id'])

#    id  a  b sa sb
# 1:  1 21 11 43 23
# 2:  1 22 12 43 23
# 3:  2 23 13 47 27
# 4:  2 24 14 47 27
# 5:  3 25 15 25 15