Apply a function across groups and columns in data.table and/or dplyr

Question

I would like to combine two data.tables or dataframes of unequal row #, where the # of rows of dt2 is the same as the number of groups of dt1. Here is a reproducible example:

a <- 1:10; b <- 2:11; c <- 3:12
groupVar <- c(1,1,1,2,2,2,3,3,3,3)
dt1 <- data.table(a,b,c,groupVar)
a2 <- c(10,20,30); b2 <- c(20,30,40); c2 <- c(30,40,50)
dt2 <- data.table(a2,b2,c2)

The real case involves a large number of columns so I with to refer to them with variables. Using either a loop or apply, I wish to add each row of dt2 to the rows comprising each group of dt1. Here is one of many attempts that fail:

for (ic in 1:3) {
  c1 <- dt2[,(ic), with=FALSE]
  c2 <- dt2[,(ic), with=FALSE]
  dt1[,(ic) := .(c1 + c2[.G]), by = "groupVar"]
}

I am interested in how to do this kind of operation "by group and by column" in both data.table syntax and dplyr syntax. In place (as above) is not critical.

desired result:

dt1 (or dt3) = 
a   b   c   groupVar
11  22  33  1
12  23  34  1
13  24  35  1
24  35  46  2 
...
40  51  62  3

see github.com/Rdatatable/data.table/pull/4304, which is not yet released but will allow easier specification of arbitrary columns in j statements. — Michael

Uwe Uwe · Accepted Answer · 2021-03-14T09:16:07

The sample datasets provided with the question indicate that the names of the columns may differ between datasets, e.g., column b of dt1 and column b2 of dt2 are supposed to be added.

Here are two approaches which should be working for an arbitrary number of arbitrarily named pairs of columns:

Working in long format
EDIT: Update joins using get()
EDIT 2: Computing on the language

1. Working in long format

The information on corresponding columns can be provided in a look-up table or translation table:

library(data.table)
lut <- data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2"))

lut

   vars1 vars2
1:     a    a2
2:     b    b2
3:     c    c2

In cases where column names are treated as data and the column data are of the same data type my first approach is to reshape to long format.

# reshape to long format
mdt1 <- melt(dt1[, rn := .I], measure.vars = lut$vars1)
mdt2 <- melt(dt2[, groupVar := .I], measure.vars = lut$vars2)
# update join to translate variable names
mdt2[lut, on = .(variable = vars2), variable := vars1]
# update join to add corresponding values of both tables 
mdt1[mdt2, on = .(groupVar, variable), value := x.value + i.value]
# reshape backe to wide format
dt3 <- dcast(mdt1, rn + groupVar ~ ...)[, rn := NULL][]
dt3

    groupVar  a  b  c
 1:        1 11 22 33
 2:        1 12 23 34
 3:        1 13 24 35
 4:        2 24 35 46
 5:        2 25 36 47
 6:        2 26 37 48
 7:        3 37 48 59
 8:        3 38 49 60
 9:        3 39 50 61
10:        3 40 51 62

2. Update joins using `get()`

Giving a second thought, here is an approach which is similar to OP's proposed for loop and requires much less coding:

vars1 <- c("a", "b", "c")
vars2 <- c("a2", "b2", "c2")
dt2[, groupVar := .I]
   
for (iv in seq_along(vars1)) {
  dt1[dt2, on = .(groupVar), 
      (vars1[iv]) := get(paste0("x.", vars1[iv])) + get(paste0("i.", vars2[iv]))][]
}

dt1[]

     a  b  c groupVar
 1: 11 22 33        1
 2: 12 23 34        1
 3: 13 24 35        1
 4: 24 35 46        2
 5: 25 36 47        2
 6: 26 37 48        2
 7: 37 48 59        3
 8: 38 49 60        3
 9: 39 50 61        3
10: 40 51 62        3

Note that dt1 is updated by reference, i.e., without copying.

Prepending the variable names vars1[iv] by "x." and vars2[iv] by "i." on the right hand side of := is to ensure that the right columns from dt1 and dt2, resp., are picked in case of duplicated column names. See the Advanced: section on the j parameter in help("data.table").

3. Computing on the language

This follows Matt Dowle's advice to create one expression to be evaluated, "similar to constructing a dynamic SQL statement to send to a server". See here for another use case.

library(glue) # literal string interpolation
library(magrittr) # piping used to improve readability

EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))

data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>% 
  glue_data("{vars1} = x.{vars1} + i.{vars2}") %>% 
  glue_collapse( sep = ", ") %>% 
  {glue("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`({.})][]")} %>% 
  EVAL()

     a  b  c groupVar
 1: 11 22 33        1
 2: 12 23 34        1
 3: 13 24 35        1
 4: 24 35 46        2
 5: 25 36 47        2
 6: 26 37 48        2
 7: 37 48 59        3
 8: 38 49 60        3
 9: 39 50 61        3
10: 40 51 62        3

It starts with a look-up table which is created on-the-fly and subsequently manipulated to form a complete data.table statement

dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(a = x.a + i.a2, b = x.b + i.b2, c = x.c + i.c2)][]

as a character string. This string is then evaluated and executed in one go; no for loops required.

As the helper function EVAL() already uses paste0() the call to glue() can be omitted:

data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>% 
  glue_data("{vars1} = x.{vars1} + i.{vars2}") %>% 
  glue_collapse( sep = ", ") %>% 
  {EVAL("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(", ., ")][]")}

Note that dot . and curly brackets {} are used with different meaning in different contexts which may appear somewhat confusing.

Apply a function across groups and columns in data.table and/or dplyr

4 Answers

1. Working in long format

2. Update joins using get()

3. Computing on the language

2. Update joins using `get()`