3
votes

I have some large data sets and am trying out data.table to combine them while summing up the shared column over matching rows. I know how to merge using [ matching rows in the LHS data.table as shown below with tables a2:LHS and a:RHS

a2 <- data.table( b= c(letters[1:5],letters[11:15]), c = as.integer(rep(100,10)))
a <- data.table(b = letters[1:10], c = as.integer(1:10))
setkey(a2 ,"b")
setkey(a , "b")

a2
    b   c
 1: a 100
 2: b 100
 3: c 100
 4: d 100
 5: e 100
 6: k 100
 7: l 100
 8: m 100
 9: n 100
10: o 100

a
    b  c
 1: a  1
 2: b  2
 3: c  3
 4: d  4
 5: e  5
 6: f  6
 7: g  7
 8: h  8
 9: i  9
10: j 10

from second answer hereMerge data frames whilst summing common columns in R I saw how columns could be summed up over matching rows, as such:

setkey(a , "b")
setkey(a2, "b")
a2[a, `:=`(c = c + i.c)]
a2
    b   c
 1: a 101
 2: b 102
 3: c 103
 4: d 104
 5: e 105
 6: k 100
 7: l 100
 8: m 100
 9: n 100
10: o 100

However I am trying retain the rows that don't match as well.

Alternately I could use merge as shown below but I would like a void making a new table with 4 rows before reducing it to 2 rows.

c <- merge(a, a2, by = "b", all=T)
c <- transform(c, value = rowSums(c[,2:3], na.rm=T))
c <- c[,c(1,4)]

c
    b value
 1: a   102
 2: b   104
 3: c   106
 4: d   108
 5: e   110
 6: f     6
 7: g     7
 8: h     8
 9: i     9
10: j    10
11: k   100
12: l   100
13: m   100
14: n   100
15: o   100

This last table is what I would like to achieve, Thanks in Advance.

1
Or rbindlist(list(a, a2))[, sum(c), b]?A5C1D2H2I1M1N2O1R2T1
Thanks I see they both work. I am trying to understand the `[ , sum..] syntax and look into how much memory they use.Bhail

1 Answers

2
votes

merge is likely to not be very efficient for the end result you are after. Since both of your data.tables have the same structure, I would suggest rbinding them together and taking the sum by their key. In other words:

rbindlist(list(a, a2))[, sum(c), b]

I've used rbindlist because it is generally more efficient at rbinding data.tables (even though you have to first put your data.tables in a list).


Compare some timings on larger datasets:

library(data.table)
library(stringi)
set.seed(1)
n <- 1e7; n2 <- 1e6
x <- stri_rand_strings(n, 4)
a2 <- data.table(b = sample(x, n2), c = sample(100, n2, TRUE))
a <- data.table(b = sample(x, n2), c = sample(10, n2, TRUE))

system.time(rbindlist(list(a, a2))[, sum(c), b])
#   user  system elapsed 
#   0.83    0.05    0.87 

system.time(merge(a2, a, by = "b", all = TRUE)[, rowSums(.SD, na.rm = TRUE), b]) # Get some coffee
#   user  system elapsed 
# 159.58    0.48  162.95 

## Do we have all the rows we expect to have?
length(unique(c(a$b, a2$b)))
# [1] 1782166

nrow(rbindlist(list(a, a2))[, sum(c), b])
# [1] 1782166