R Data Frame Filter Not Working

Question

I am trying to filter the output from RNA-seq data analysis. I want to generate a list of genes that fit the specified criteria in at least one experimental condition (dataframe).

For example, the data is output as a .csv, so I read in the whole directory, as follows.

readList = list.files("~/Path/To/File/", pattern = "*.csv")
files = lapply(readList, read.csv, row.names = 1)   
#row.names = 1 sets rownames as gene names

This reads in 3 .csv files, A, B and C. The data look like this

A = files[[1]]
B = files[[2]]
C = files[[3]]
head(A)
            logFC    logCPM       LR            PValue           FDR
YER037W -1.943616  6.294092 34.30835 0.000000004703583 0.00002276064
YJL184W -1.771273  5.840774 31.97088 0.000000015650144 0.00003786552
YFR053C  1.990102 10.107793 30.55576 0.000000032440747 0.00005232692
YDR342C  2.096877  6.534761 28.08635 0.000000116021451 0.00014035695
YGL062W  1.649138  8.940714 23.32097 0.000001370968319 0.00132682314
YFR044C  1.992810  9.302504 22.91553 0.000001692786468 0.00132736130

I then try to filter all of these to generate a list of genes (rownames) where two conditions must be met in at least one dataset.

1.logFC > 1 or < -1

2.FDR < 0.05

So I loop through the dataframes like so

genesKeep = ""
for (i in 1:length(files) {
F = data.frame(files[i])
sigGenes = rownames(F[F$FDR<0.05 & abs(F$logFC>1), ])
genesKeep = append(genesKeep, values = sigGenes)

}

This gives me a list of genes, however, when I sanity check these against the data some of the genes listed do not pass these thresholds, whilst other genes that do pass these thresholds are not present in the list.

e.g.

df = cbind(A,B,C)
genesKeep = unique(genesKeep)
logicTest = rownames(df) %in% genesKeep
dfLogic = cbind(df, logicTest)

whilst the majority of genes do infact pass the criteria I set, I see some discrepancies for a few genes. For example

          A.logFC    A.FDR     B.logFC    B.FDR     C.logFC    C.FDR   logicTest
YGR181W -0.8050325 0.1462688 -0.6834184 0.2162317 -1.1923744 0.04049870 FALSE
YOR185C  0.8321432 0.1462919  0.7401477 0.2191413 -0.9616989 0.04098177 TRUE

The first gene (YGR181W) passes the criteria in condition C, where logFC < -1 and FDR < 0.05. However, the gene is not reported in the genesKeep list.

Conversely, the second gene (YOR185C) does not pass these criteria in any condition, but the gene is present in the genesKeep list.

I'm unsure where I'm going wrong here, but if anyone has any ideas they would be much appreciated.

Thanks.

You may want to try sorting by row names, or merging as opposed to cbinding because there may be an issue with paralleling row.names between datasets — akash87
@akash87 you're right! Something so simple has been holding me up all day. Thank you so much for your suggestion! — Christian Bates

Christian Bates Christian Bates · Accepted Answer · 2017-05-22T20:07:34

Using merge as suggested by akash87 solved the problem.

Turns out cbind was causing the rownames to not be assigned correctly.

R Data Frame Filter Not Working

2 Answers