9
votes

I want to efficiently sum the entries of two data frames, though the data frames are not guaranteed to have the same dimensions or column names. Merge isn't really what I'm after here. Instead I want to create an output object with all of the row and column names that belong to either of the added data frames. In each position of that output, I want to use the following logic for the computed value:

  • If a row/column pairing belongs to both input data frames I want the output to include their sum
  • If a row/column pairing belongs to just one input data frame I want to include that value in the output
  • If a row/column pairing does not belong to any input matrix I want to have 0 in that position in the output.

As an example, consider the following input data frames:

df1 = data.frame(x = c(1,2,3), y = c(4,5,6))
rownames(df1) = c("a", "b", "c")
df2 = data.frame(x = c(7,8), z = c(9,10), w = c(2, 3))
rownames(df2) = c("a", "d")
> df1
  x y
a 1 4
b 2 5
c 3 6
> df2
  x  z  w 
a 7  9  2
d 8 10  3

I want the final result to be

> df2
   x  y   z  w
a  8  4   9  2
b  2  5   0  0
c  3  6   0  0
d  8  0  10  3

What I've done so far -

bind_rows / bind_cols in dplyr can throw the following: "Error: incompatible number of rows (3, expecting 2)"

I have duplicated column names, so 'merge' isn't working for my purposes either - returns an empty df for some reason.

5

5 Answers

4
votes

Seems like you could merge on the rownames, then take care of the sums and conversion of NA to zero with some additional munging:

library(dplyr)

df.new = df1 %>% add_rownames %>%
  full_join(df2 %>% add_rownames, by="rowname") %>%
  mutate_each(funs(replace(., which(is.na(.)), 0))) %>%
  mutate(x = x.x + x.y) %>%
  select(rowname,x,y,z,w)

Or, with @DavidArenburg's much more elegant and extensible solution:

df.new = df1 %>% add_rownames %>% 
  full_join(df2 %>% add_rownames) %>% 
  group_by(rowname) %>% 
  summarise_each(funs(sum(., na.rm = TRUE)))

df.new

  rowname     x     y     z     w
1       a     8     4     9     2
2       b     2     5     0     0
3       c     3     6     0     0
4       d     8     0    10     3
3
votes

This seems like some type of a simple merge on common column names (+ row names) and then a simple aggregation, this is how I would tackle this

library(data.table)
merge(setDT(df1, keep.rownames = TRUE), # Convert to data.table + keep rows
      setDT(df2, keep.rownames = TRUE), # Convert to data.table + keep rows
      by = intersect(names(df1), names(df2)), # merge on common column names
      all = TRUE)[, lapply(.SD, sum, na.rm = TRUE), by = rn] # Sum all columns by group                   
#    rn x y  z w
# 1:  a 8 4  9 2
# 2:  b 2 5  0 0
# 3:  c 3 6  0 0
# 4:  d 8 0 10 3

Are a pretty straight forward base R solution

df1$rn <- row.names(df1)
df2$rn <- row.names(df2)
res <- merge(df1, df2, all = TRUE)
rowsum(res[setdiff(names(res), "rn")], res[, "rn"], na.rm = TRUE)
#   x y  z w
# a 8 4  9 2
# b 2 5  0 0
# c 3 6  0 0
# d 8 0 10 3
2
votes

First, I would grab the names of all the rows and columns of the new entity:

(all.rows <- unique(c(row.names(df1), row.names(df2))))
# [1] "a" "b" "c" "d"
(all.cols <- unique(c(names(df1), names(df2))))
# [1] "x" "y" "z" "w"

Then I would construct an output matrix with those rows and column names (with matrix data initialized to all 0s), adding df1 and df2 to the relevant parts of that matrix.

out <- matrix(0, nrow=length(all.rows), ncol=length(all.cols))
rownames(out) <- all.rows
colnames(out) <- all.cols
out[row.names(df1),names(df1)] <- unlist(df1)
out[row.names(df2),names(df2)] <- out[row.names(df2),names(df2)] + unlist(df2)
out
#   x y  z w
# a 8 4  9 2
# b 2 5  0 0
# c 3 6  0 0
# d 8 0 10 3
1
votes

Using xtabs on melted / stacked data frames:

out <- rbind(cbind(rn=rownames(df1),stack(df1)), cbind(rn=rownames(df2),stack(df2)))
as.data.frame.matrix(xtabs(values ~ rn + ind, data=out))

#  x y w  z
#a 8 4 2  9
#b 2 5 0  0
#c 3 6 0  0
#d 8 0 3 10
0
votes

I’m not convinced the accepted (or alternative merge) method is the best. It will give incorrect results if you have common rows, they’ll get joined and not summed.

This can be shown trivialy by changing df2 to:

df2 = data.frame(x = c(1,2), y = c(4,5), z = c(9,10), w = c(2, 3))
rownames(df2) = c("a", "d")

expected results:

   rn x y  z w
1:  a 2 8  9 2
2:  b 2 5  0 0
3:  c 3 6  0 0
4:  d 2 5 10 3

actual results

merge(setDT(df1, keep.rownames = TRUE), 
  setDT(df2, keep.rownames = TRUE), 
  by = intersect(names(df1), names(df2)), 
  all = TRUE)[, lapply(.SD, sum, na.rm = TRUE), by = rn]

   rn x y  z w
1:  a 1 4  9 2
2:  b 2 5  0 0
3:  c 3 6  0 0
4:  d 2 5 10 3

You need to combine both the outer join with an inner join (or left/right joins, merge all=T/all=F). Or alternatively using plyr’s rbind.fill :

base R solution

res <- rbind.fill(df1,df2)
rowsum(res[setdiff(names(res), "rn")], res[, "rn"], na.rm = TRUE)

data table solution

as.data.table(rbind.fill(
  setDT(df1, keep.rownames = TRUE),
  setDT(df2, keep.rownames = TRUE)
))[, lapply(.SD, sum, na.rm = TRUE), by = rn]

I prefer the rbind.fill method as you can "merge" > 2 data frames using the same syntax.