1
votes

I have a matrix of gene names with expression values in different tissues. However, the analyses were performed independently and not all genes are present in all tissues. The gene lists for each tissue were simply pasted below each other. Right now it looks like this:

 GeneName   Tissue A Tissue B
Gene A  1------------
Gene B  1------------
Gene C  2-----------
Gene A ---------3
Gene D ----------2

I would like to collapse the gene name multiples so that i get a matrix like the following:

GeneName   Tissue A Tissue B
Gene A 1---------3
Gene B 1---------
Gene C 2----------
Gene D ---------2

Edit: Thanks for the answer. However, I missed adding that the gene names are a column of their own, while the row names are simply numbers 1-n. I tried to set the name column as row name row.names(mydataframe)<-mydataframe$GeneName, but got the following error message Error inrow.names<-.data.frame(tmp, value = c(578L, 510L, 1707L, : duplicate 'row.names' are not allowed In addition: Warning message: non-unique values when setting 'row.names': As I understand it I can't use a column with non-unique values as row name, which seems to put me in a catch-22 if I need to name the rows after the gene name column to be able to collapse the matrix?

1
Do you have NA values for those missing slotsakrun
It would be good to display the matrices/data.frames in your question as you see them in your terminal or Rstudio. Why does Gene D become 3 in the output?Ruthger Righart
@RuthgerRighart I would assume it to be a typoakrun
I updated the post. Regarding the warning message, 'data.frame' can't have duplicate row names. You need to convert it to matrix (though your original description says it as matrix)akrun

1 Answers

3
votes

Assuming that the the missing values are 'NA' and the 'Tissue.B' value in the output for 'Gene D' is 2, you may use

 res <- rowsum(m1, row.names(m1), na.rm=TRUE)
 is.na(res) <- res==0
 res
 #       Tissue.A Tissue.B
 #Gene A        1        3
 #Gene B        1       NA
 #Gene C        2       NA
 #Gene D       NA        2

If it is a data.frame with 'GeneName' as column

 library(dplyr)
 df1 %>%
    group_by(GeneName) %>% 
    summarise_each(funs(sum=sum(., na.rm=TRUE)))
 #    GeneName Tissue.A Tissue.B
 #1   Gene A        1        3
 #2   Gene B        1        0
 #3   Gene C        2        0
 #4   Gene D        0        2

and we can replace the 0 with NA as before.

Or using aggregate from base R

  aggregate(.~GeneName, df1, sum, na.rm=TRUE, na.action=NULL)

data

 m1 <- structure(c(1L, 1L, 2L, NA, NA, NA, NA, NA, 3L, 2L), .Dim = c(5L, 
 2L), .Dimnames = list(c("Gene A", "Gene B", "Gene C", "Gene A", 
"Gene D"), c("Tissue.A", "Tissue.B")))

 df1 <- structure(list(GeneName = c("Gene A", "Gene B", "Gene C",
  "Gene A", 
 "Gene D"), Tissue.A = c(1L, 1L, 2L, NA, NA), Tissue.B = c(NA, 
 NA, NA, 3L, 2L)), .Names = c("GeneName", "Tissue.A", "Tissue.B"
 ), class = "data.frame", row.names = c(NA, -5L))