R data.table: accessing column with variable name

Question

I am using the wonderful R data.table package. However, accessing (i.e. manipulating by reference) a column with a variable name is very clumsy: If we are given a data.table dt which has two columns x and y and we want to add two columns and name it z then the command is

dt = dt[, z := x + y]

Now let us write a function add that takes as arguments a (reference to a) data.table dt and three column names summand1Name, summand2Name and resultName and it is supossed to execute the exact same command as above only with general column names. The solution I am using right now is reflection, i.e.

add = function(dt, summand1Name, summand2Name, resultName) {
  cmd = paste0('dt = dt[, ', resultName, ' := ', summand1Name, ' + ', summand2Name, ']')
  eval(parse(text=cmd))
  return(dt) # optional since manipulated  by reference
}

However I am absolutely not satisfied with this solution. First of all it's clumsy, it does not make fun to code like this. It is hard to debug and it just pisses me off and burns time. Secondly, it is harder to read and understand. Here is my question:

Can we write this function in a somewhat nicer way?

I am aware of the fact that one can access columns with variable name like so: dt[[resultName]] but when I write

dt[[resultName]] = dt[[summand1Name]] + dt[[summand2Name]]

then data.table starts to complain about having taken copies and not working by reference. I don't want that. Also I like the syntax dt = dt[<all 'database related operations'>] so that everything I am doing is stuck together in one pair of brackets. Isn't it possible to make use of a special symbol like backticks or so in order to indicate that the name currently used is not referencing an actual column of the data table but rather is a placeholder for the name of an actual column?

How about add = function(dt, summand1Name, summand2Name, resultName) dt[, (resultName) := .SD[[summand1Name]] + .SD[[summand2Name]]]? Another option could be add2 = function(dt, summand1Name, summand2Name, resultName) dt[, (resultName) := eval(as.name(summand1Name)) + eval(as.name(summand2Name))] or just use get as suggested above. — David Arenburg

amatsuo_net amatsuo_net · Accepted Answer · 2017-05-24T08:01:54

You can combine the use of () on the LHS of := as well as with = FALSE in referencing a variable on the RHS.

dt <- data.table(a = 1:5, b = 10:14)
my_add <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := dt[, summand1Name, with = FALSE] + 
       dt[, summand1Name, with = FALSE]]
}
my_add(dt, 'a', 'b', 'c')
dt

Edit:

Compared three versions. Mine is the most inefficient... (but will keep it just for reference).

set.seed(1)
dt <- data.table(a = rnorm(10000), b = rnorm(10000))
original_add <- function(dt, summand1Name, summand2Name, resultName) {
  cmd = paste0('dt = dt[, ', resultName, ' := ', summand1Name, ' + ', summand2Name, ']')
  eval(parse(text=cmd))
  return(dt) # optional since manipulated  by reference
}
my_add <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := dt[, summand1Name, with = FALSE] + 
       dt[, summand1Name, with = FALSE]]
}
list_access_add <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := dt[[summand1Name]] + dt[[summand2Name]]]
}
david_add <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := .SD[[summand1Name]] + .SD[[summand2Name]]]
}

microbenchmark::microbenchmark(
  original_add(dt, 'a', 'b', 'c'),
  my_add(dt, 'a', 'b', 'c'),
  list_access_add(dt, 'a', 'b', 'c'),
  david_add(dt, 'a', 'b', 'c'))

## Unit: microseconds
##                                expr      min        lq      mean    median        uq      max
##     original_add(dt, "a", "b", "c")  604.397  659.6395  784.2206  713.0315  776.1295 5070.541
##           my_add(dt, "a", "b", "c") 1063.984 1168.6140 1460.5329 1247.7990 1486.9730 6134.959
##  list_access_add(dt, "a", "b", "c")  272.822  310.9680  422.6424  334.3110  380.6885 3620.463
##        david_add(dt, "a", "b", "c")  389.389  431.9080  542.7955  454.5335  493.4895 3696.992
##  neval
##    100
##    100
##    100
##    100

Edit2:

With one million rows, the result looks like this. As expected the original method perform well as once eval is done this will work fast.

## Unit: milliseconds
##                                expr       min        lq      mean    median        uq      max
##     original_add(dt, "a", "b", "c")  2.493553  3.499039  6.585651  3.607101  4.390051 114.0612
##           my_add(dt, "a", "b", "c") 11.821820 14.512878 28.387841 17.412433 19.642231 117.6359
##  list_access_add(dt, "a", "b", "c")  2.161276  3.133110  6.874885  3.218185  3.407776 107.6853
##        david_add(dt, "a", "b", "c")  2.237089  3.313133  6.047832  3.381757  3.788558 103.7532
##  neval
##    100
##    100
##    100
##    100

R data.table: accessing column with variable name

4 Answers