0
votes

I have a large data matrix (37000 x 2689) with duplicated rownames, I am trying to consolidate column values (samples) by their row names. I have tried using sum by package dplyr but it does not help Eg, here gene column would ideally have become the rowname but R does not allow duplicate row names.

gene sampleA sampleB sampleC
aaa    0        0      78
bbb    0        0       1
ccc    0        0      34
aaa    0        10      0
bbb    0        2       0
ccc    0        17      0
aaa    3         0      0
bbb    900       0      0
ccc    6         0      0
1

1 Answers

0
votes

Using dplyr, this should be straightforward:

set.seed(123)
df <- data_frame(gene=rep(c('aaa', 'bbb', 'ccc'), 3), 
      sampleA=rnorm(9), sampleB=rnorm(9), sampleC=rnorm(9))

This will give you..

> head(df)
# A tibble: 6 x 4
  gene  sampleA sampleB sampleC
  <chr>   <dbl>   <dbl>   <dbl>
1 aaa   -0.560   -0.446   0.701
2 bbb   -0.230    1.22   -0.473

And then you aggregate using dplyr's group_by and summarise_at functions.

df %>% 
group_by(gene) %>%
summarise_at(.vars = vars(sampleA, sampleB, sampleC), sum)