0
votes

I am trying to filter the output from RNA-seq data analysis. I want to generate a list of genes that fit the specified criteria in at least one experimental condition (dataframe).

For example, the data is output as a .csv, so I read in the whole directory, as follows.

readList = list.files("~/Path/To/File/", pattern = "*.csv")
files = lapply(readList, read.csv, row.names = 1)   
#row.names = 1 sets rownames as gene names

This reads in 3 .csv files, A, B and C. The data look like this

A = files[[1]]
B = files[[2]]
C = files[[3]]
head(A)
            logFC    logCPM       LR            PValue           FDR
YER037W -1.943616  6.294092 34.30835 0.000000004703583 0.00002276064
YJL184W -1.771273  5.840774 31.97088 0.000000015650144 0.00003786552
YFR053C  1.990102 10.107793 30.55576 0.000000032440747 0.00005232692
YDR342C  2.096877  6.534761 28.08635 0.000000116021451 0.00014035695
YGL062W  1.649138  8.940714 23.32097 0.000001370968319 0.00132682314
YFR044C  1.992810  9.302504 22.91553 0.000001692786468 0.00132736130

I then try to filter all of these to generate a list of genes (rownames) where two conditions must be met in at least one dataset.

1.logFC > 1 or < -1

2.FDR < 0.05

So I loop through the dataframes like so

genesKeep = ""
for (i in 1:length(files) {
F = data.frame(files[i])
sigGenes = rownames(F[F$FDR<0.05 & abs(F$logFC>1), ])
genesKeep = append(genesKeep, values = sigGenes)

}

This gives me a list of genes, however, when I sanity check these against the data some of the genes listed do not pass these thresholds, whilst other genes that do pass these thresholds are not present in the list.

e.g.

df = cbind(A,B,C)
genesKeep = unique(genesKeep)
logicTest = rownames(df) %in% genesKeep
dfLogic = cbind(df, logicTest)

whilst the majority of genes do infact pass the criteria I set, I see some discrepancies for a few genes. For example

          A.logFC    A.FDR     B.logFC    B.FDR     C.logFC    C.FDR   logicTest
YGR181W -0.8050325 0.1462688 -0.6834184 0.2162317 -1.1923744 0.04049870 FALSE
YOR185C  0.8321432 0.1462919  0.7401477 0.2191413 -0.9616989 0.04098177 TRUE

The first gene (YGR181W) passes the criteria in condition C, where logFC < -1 and FDR < 0.05. However, the gene is not reported in the genesKeep list.

Conversely, the second gene (YOR185C) does not pass these criteria in any condition, but the gene is present in the genesKeep list.

I'm unsure where I'm going wrong here, but if anyone has any ideas they would be much appreciated.

Thanks.

2
You may want to try sorting by row names, or merging as opposed to cbinding because there may be an issue with paralleling row.names between datasetsakash87
@akash87 you're right! Something so simple has been holding me up all day. Thank you so much for your suggestion!Christian Bates

2 Answers

1
votes

Using merge as suggested by akash87 solved the problem.

Turns out cbind was causing the rownames to not be assigned correctly.

0
votes

I'm not exactly sure what your desired output is here, but it might be possible to simplify a bit and use the dplyr library to filter all your outputs at once, assuming the format of your data is consistent. Using some modified versions of your data as an example:

A <- structure(list(gene = structure(c(2L, 6L, 4L, 1L, 5L, 3L), .Label = c("YDR342C", 
"YER037W", "YFR044C", "YFR053C", "YGL062W", "YJL184W"), class = "factor"), 
    logFC = c(-1.943616, -1.771273, 0, 2.096877, 1.649138, 1.99281
    ), logCPM = c(6.294092, 5.840774, 10.107793, 6.534761, 8.940714, 
    9.302504), LR = c(34.30835, 31.97088, 30.55576, 28.08635, 
    23.32097, 22.91553), PValue = c(4.703583e-09, 1.5650144e-08, 
    3.2440747e-08, 1.16021451e-07, 1.370968319e-06, 1.692786468e-06
    ), FDR = c(2.276064e-05, 3.786552e-05, 5.232692e-05, 0.00014035695, 
    0.00132682314, 0.06)), .Names = c("gene", "logFC", "logCPM", 
"LR", "PValue", "FDR"), class = "data.frame", row.names = c(NA, 
-6L))

B <- structure(list(gene = structure(c(2L, 6L, 4L, 1L, 5L, 3L), .Label = c("YDR342C", 
"YER037W", "YFR044C", "YFR053C", "YGL062W", "YJL184W"), class = "factor"), 
    logFC = c(-0.4, -0.3, 0, 2.096877, 1.649138, 1.99281), logCPM = c(6.294092, 
    5.840774, 10.107793, 6.534761, 8.940714, 9.302504), LR = c(34.30835, 
    31.97088, 30.55576, 28.08635, 23.32097, 22.91553), PValue = c(4.703583e-09, 
    1.5650144e-08, 3.2440747e-08, 1.16021451e-07, 1.370968319e-06, 
    1.692786468e-06), FDR = c(2.276064e-05, 3.786552e-05, 5.232692e-05, 
    0.00014035695, 0.1, 0.06)), .Names = c("gene", "logFC", "logCPM", 
"LR", "PValue", "FDR"), class = "data.frame", row.names = c(NA, 
-6L))

Use rbind to create a single dataframe to work with:

AB<- rbind(A,B)

Then filter this whole thing based on your criteria. Note that duplicates can occur, so you can use distinct to only return unique genes that qualify:

filter(AB, logFC < -1 | logFC > 1, FDR < 0.05) %>%
  distinct(gene)

     gene
1 YER037W
2 YJL184W
3 YDR342C
4 YGL062W

Or, to keep all the rows for those genes as well:

filter(AB, logFC < -1 | logFC > 1, FDR < 0.05) %>%
  distinct(gene, .keep_all = TRUE)

     gene     logFC   logCPM       LR       PValue          FDR
1 YER037W -1.943616 6.294092 34.30835 4.703583e-09 2.276064e-05
2 YJL184W -1.771273 5.840774 31.97088 1.565014e-08 3.786552e-05
3 YDR342C  2.096877 6.534761 28.08635 1.160215e-07 1.403570e-04
4 YGL062W  1.649138 8.940714 23.32097 1.370968e-06 1.326823e-03