3
votes

I would like to combine two data.tables or dataframes of unequal row #, where the # of rows of dt2 is the same as the number of groups of dt1. Here is a reproducible example:

a <- 1:10; b <- 2:11; c <- 3:12
groupVar <- c(1,1,1,2,2,2,3,3,3,3)
dt1 <- data.table(a,b,c,groupVar)
a2 <- c(10,20,30); b2 <- c(20,30,40); c2 <- c(30,40,50)
dt2 <- data.table(a2,b2,c2)

The real case involves a large number of columns so I with to refer to them with variables. Using either a loop or apply, I wish to add each row of dt2 to the rows comprising each group of dt1. Here is one of many attempts that fail:

for (ic in 1:3) {
  c1 <- dt2[,(ic), with=FALSE]
  c2 <- dt2[,(ic), with=FALSE]
  dt1[,(ic) := .(c1 + c2[.G]), by = "groupVar"]
}

I am interested in how to do this kind of operation "by group and by column" in both data.table syntax and dplyr syntax. In place (as above) is not critical.

desired result:

dt1 (or dt3) = 
a   b   c   groupVar
11  22  33  1
12  23  34  1
13  24  35  1
24  35  46  2 
...
40  51  62  3
4
see github.com/Rdatatable/data.table/pull/4304, which is not yet released but will allow easier specification of arbitrary columns in j statements.Michael

4 Answers

1
votes

The sample datasets provided with the question indicate that the names of the columns may differ between datasets, e.g., column b of dt1 and column b2 of dt2 are supposed to be added.

Here are two approaches which should be working for an arbitrary number of arbitrarily named pairs of columns:

  1. Working in long format
  2. EDIT: Update joins using get()
  3. EDIT 2: Computing on the language

1. Working in long format

The information on corresponding columns can be provided in a look-up table or translation table:

library(data.table)
lut <- data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2"))

lut
   vars1 vars2
1:     a    a2
2:     b    b2
3:     c    c2

In cases where column names are treated as data and the column data are of the same data type my first approach is to reshape to long format.

# reshape to long format
mdt1 <- melt(dt1[, rn := .I], measure.vars = lut$vars1)
mdt2 <- melt(dt2[, groupVar := .I], measure.vars = lut$vars2)
# update join to translate variable names
mdt2[lut, on = .(variable = vars2), variable := vars1]
# update join to add corresponding values of both tables 
mdt1[mdt2, on = .(groupVar, variable), value := x.value + i.value]
# reshape backe to wide format
dt3 <- dcast(mdt1, rn + groupVar ~ ...)[, rn := NULL][]
dt3
    groupVar  a  b  c
 1:        1 11 22 33
 2:        1 12 23 34
 3:        1 13 24 35
 4:        2 24 35 46
 5:        2 25 36 47
 6:        2 26 37 48
 7:        3 37 48 59
 8:        3 38 49 60
 9:        3 39 50 61
10:        3 40 51 62

2. Update joins using get()

Giving a second thought, here is an approach which is similar to OP's proposed for loop and requires much less coding:

vars1 <- c("a", "b", "c")
vars2 <- c("a2", "b2", "c2")
dt2[, groupVar := .I]
   
for (iv in seq_along(vars1)) {
  dt1[dt2, on = .(groupVar), 
      (vars1[iv]) := get(paste0("x.", vars1[iv])) + get(paste0("i.", vars2[iv]))][]
}

dt1[]
     a  b  c groupVar
 1: 11 22 33        1
 2: 12 23 34        1
 3: 13 24 35        1
 4: 24 35 46        2
 5: 25 36 47        2
 6: 26 37 48        2
 7: 37 48 59        3
 8: 38 49 60        3
 9: 39 50 61        3
10: 40 51 62        3

Note that dt1 is updated by reference, i.e., without copying.

Prepending the variable names vars1[iv] by "x." and vars2[iv] by "i." on the right hand side of := is to ensure that the right columns from dt1 and dt2, resp., are picked in case of duplicated column names. See the Advanced: section on the j parameter in help("data.table").

3. Computing on the language

This follows Matt Dowle's advice to create one expression to be evaluated, "similar to constructing a dynamic SQL statement to send to a server". See here for another use case.

library(glue) # literal string interpolation
library(magrittr) # piping used to improve readability

EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))

data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>% 
  glue_data("{vars1} = x.{vars1} + i.{vars2}") %>% 
  glue_collapse( sep = ", ") %>% 
  {glue("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`({.})][]")} %>% 
  EVAL()
     a  b  c groupVar
 1: 11 22 33        1
 2: 12 23 34        1
 3: 13 24 35        1
 4: 24 35 46        2
 5: 25 36 47        2
 6: 26 37 48        2
 7: 37 48 59        3
 8: 38 49 60        3
 9: 39 50 61        3
10: 40 51 62        3

It starts with a look-up table which is created on-the-fly and subsequently manipulated to form a complete data.table statement

dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(a = x.a + i.a2, b = x.b + i.b2, c = x.c + i.c2)][]

as a character string. This string is then evaluated and executed in one go; no for loops required.

As the helper function EVAL() already uses paste0() the call to glue() can be omitted:

data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>% 
  glue_data("{vars1} = x.{vars1} + i.{vars2}") %>% 
  glue_collapse( sep = ", ") %>% 
  {EVAL("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(", ., ")][]")}

Note that dot . and curly brackets {} are used with different meaning in different contexts which may appear somewhat confusing.

1
votes

Assuming that the column names are consistent (e.g. you want a + a2, b + b2...etc), here is a tidyverse solution that starts in a similar way as @dclarson's, then uses the bang-bang operator to select the columns to add up.

Is this what you are after?

## Create tibbles and join
dt1 <- tibble(groupVar,a,b,c)
dt2 <- tibble(groupVar = 1:3,a2,b2,c2)
dt3 <- inner_join(dt1,dt2)

## Define the column starters you are interested in
cols <- c("a","b","c")
## Or in case of many columns
cols <- colnames(dt1[-1])

## Create function to add columns with the same starting letters
add_cols <- function(col){
  dt3 %>% select(starts_with(!!col)) %>% 
    transmute(!!(sym(col)) :=  !!(sym(col)) +  !!(sym(paste0(col,"2")))) 
}
## map the function and add groupVar
 map_dfc(cols,add_cols) %>% mutate(groupVar = dt3$groupVar)

    # A tibble: 10 x 4
       a     b     c groupVar
   <dbl> <dbl> <dbl>    <dbl>
 1    11    22    33        1
 2    12    23    34        1
 3    13    24    35        1
 4    24    35    46        2
 5    25    36    47        2
 6    26    37    48        2
 7    37    48    59        3
 8    38    49    60        3
 9    39    50    61        3
10    40    51    62        3
0
votes

It is simple if you add groupVar to dt2:

dt2 <- data.table(a2, b2, c2, groupVar=1:3)
dt3 <- merge(dt1, dt2)
dt4 <- with(dt3, data.table(a=a+a2, b=b+b2, c=c+c2, groupVar))
dt4
#      a  b  c groupVar
#  1: 11 22 33        1
#  2: 12 23 34        1
#  3: 13 24 35        1
#  4: 24 35 46        2
#  5: 25 36 47        2
#  6: 26 37 48        2
#  7: 37 48 59        3
#  8: 38 49 60        3
#  9: 39 50 61        3
# 10: 40 51 62        3
0
votes

This should solve your desire:

  1. Create a groupVar in dt2 with unique groupVar from dt1
  2. right_join by groupVar
  3. Create new columns a, b, c with mutate
  4. Keep a, b, c and groupVar as desired output
library(dplyr)

dt3 <- dt2 %>% 
  mutate(groupVar = unique(dt1$groupVar)) %>% 
  right_join(dt1, by="groupVar") %>% 
  mutate(a = a + a2,
         b = b + b2,
         c = c + c2) %>% 
  select(a, b, c, groupVar)

data:

library(data.table)
a <- 1:10; b <- 2:11; c <- 3:12
groupVar <- c(1,1,1,2,2,2,3,3,3,3)
dt1 <- data.table(a,b,c,groupVar)
a2 <- c(10,20,30); b2 <- c(20,30,40); c2 <- c(30,40,50)
dt2 <- data.table(a2,b2,c2)

enter image description here