1
votes

I have five data.frames with gene expression data for different sets of samples. I have a different number of rows in each data.set and therefore only partly overlapping row.names (genes).

Now I want a) to filter the five data.frames to contain only genes that are present in all data.frames and b) to combine the gene expression data for those genes to one data.frame.

All I could find so far was merge, but that can only merge two data.frames, so I'd have to use it multiple times. Is there an easier way?

2
Are there differences in the data between the data frames? In other words, do you want to obtain five rows or one row per unique row name?Sven Hohenstein
I want one row per unique row name, but multiple columns (one for each sample).Lilith-Elina

2 Answers

5
votes

Merging is not very efficient if you want to exclude row names which are not present in every data frame. Here's a different proposal.

First, three example data frames:

df1 <- data.frame(a = 1:5, b = 1:5, 
                  row.names = letters[1:5]) # letters a to e
df2 <- data.frame(a = 1:5, b = 1:5, 
                  row.names = letters[3:7]) # letters c to g
df3 <- data.frame(a = 1:5, b = 1:5, 
                  row.names = letters[c(1,2,3,5,7)]) # letters a, b, c, e, and g
# row names being present in all data frames: c and e

Put the data frames into a list:

dfList <- list(df1, df2, df3)

Find common row names:

idx <- Reduce(intersect, lapply(dfList, rownames))

Extract data:

df1[idx, ]

  a b
c 3 3
e 5 5

PS. If you want to keep the corresponding rows from all data frames, you could replace the last step, df1[idx, ], with the following command:

do.call(rbind, lapply(dfList, "[", idx, ))
0
votes

Check out the uppermost answer in this SO post. Just list your data frames and apply the following line of code:

Reduce(function(...) merge(..., by = "x"), list.of.dataframes)

You just have to adjust the by argument to specify by which common column the data frames should be merged.