3
votes

I want to use column names for an assignment by reference (:=) within a data.table. The function called is doing some calculation per row over several columns. I use the current development version of data.table (v1.9.7), which makes the parameter "with=TRUE" unnecessary.

A running minimal example with explicit variable names is:

DT <- data.table(a = 1:10, b = seq(2, 20, 2), c = seq(5, 50, 5))
DT[, out := sum(a, b), by = 1:nrow(DT)]

But if I have a lot of columns and I call the function with a single variable containing the (selected) column names, the code fails:

DT  <- data.table(a = 1:10, b = seq(2, 20, 2))
col <- colnames(DT)
DT[, out := sum(col), by = 1:nrow(DT)]

EDIT:

David Arenburg's answer DT[, out := Reduce(+, .SD), .SDcols = col] works for this specific case. But I do not really understand how this approach can be applied to another function call. I wrote the following function to test:

myfun <- function(x, y, ...){
   in.tmp1 <- x
   in.tmp2 <- c(y, ...)
   out.tmp <- in.tmp1 + mean(in.tmp2)
   return(out.tmp)
}

Again, writing explicitly the column names the following approach works:

DT <- data.table(a = 1:10, b = seq(2, 20, 2), c = seq(5, 50, 5))
DT[, out := myfun(a,b,c), by = 1:nrow(DT)]

But I can't work out a more general solution for a large subset within the data.table specified by their columns names.

1
If you are doing by = 1:nrow(DT) you are doing it wrong. I would go with DT[, out := Reduce(`+`, .SD), .SDcols = col] - David Arenburg
Thanks, this works indeed. But does this work also with other functions (e.g. mean) or own written functions? I got the idea of by = 1:nrow(DT) by the answer of this question stackoverflow.com/questions/25431307/…. For my first example, it does work as it is supposed to. - moremo
Well, like eddi said, it is better to vectorize. It depends on the function. The only case when you will use by = 1:nrow(DT) is when there is absolutely no other choice. Neither R or data.table were designed to work well by row, rather by columns/matrices. Again, it depends on your function. Also, if your data set is small, I guess it's not such a big deal to work by row. - David Arenburg
I find this Q&A (and links therein) quite useful when considering row-wise operations: How to do row wise operations on .SD columns in data.table - Henrik
Thanks you all, but I still haven't managed to call a function with many parameters within the data.table. I think the problem are the quotes. I tried according to this answer stackoverflow.com/questions/12603890/… to use col <- quote(c(b,c)) and DT[, out := myfun(a,eval(col)), by = 1:nrow(DT)]. This theoretically works, but I still have the problem, that I have to type all e.g. 500 column names by hand. Suggestions anyone?! - moremo

1 Answers

0
votes

Consider the following:

library("data.table")

dt <- data.table(a = 1:5, b = 5:1, c = 1, d = 2, e = 5:1)


myfun <- function(x, y, ...){
  in.tmp1 <- x
  in.tmp2 <- c(y, ...)
  out.tmp <- in.tmp1 + mean(in.tmp2)
  return(out.tmp)
}

my_vars <- c("a", "c", "d")

var_list <- mget(my_vars, envir = as.environment(dt))

names(var_list)[1:2] <- c("x", "y")

dt[, "out" := do.call(myfun, var_list)]

Here we collect an arbitrary set of columns in my_vars to var_list, a list of non-copied aliases for the appropriate columns from dt. It is possible to pass columns as arguments of a function in R using do.call, but the names of the elements in the argument list (here var_list) must match to the names of the arguments of the function (myfun has args "x" and "y" and "...", but the last takes elements of any name).

If you want to make more use of data.table and not use mget, try

## so myfun finds the correct columns for args "x" and "y"
setnames(dt, c("a", "c"), c("x", "y"))

my_vars <- c("x", "y", "d")
dt[, "out" := do.call(myfun, .SD), .SDcols = my_vars]

EDIT 2017-02-22: using unnamed columns also allowed in do.call.

dt[, "out" := do.call(myfun, unname(as.list(.SD))), .SDcols = my_vars]