calculating correlations all against all genes in R: What would be the best approach to do it?

Question

I have 14000 gene (column:Gene) and 200 samples (column: sample1 sample2 ...)

I am trying calculate correlations for ~14000 genes all against all and append all gene correlations and required columns from the dataset(test_df) in a new dataframe(df1) and write results to a text file.

When I run the the code, I am getting correlations between (Gene1 and Gene2) and (Gene1 and Gene3). When the loop comes to Gene2 It breaks and the error says

Error in cor.test.default(as.matrix(test_df[i, ][, 3:length(test_df)]), : not enough finite observations

I have 3 to 4 values per rows this shouldn't be the case.

Please suggest any efficient way of doing this since I have to do correlations for 14000 genes.How can I run this code on multiple cores to get results faster?

Please find the code and the resulted file below.

Thanks in advance

> test_df <- data.frame(ID=c("ID_3721", "ID_537", "ID_555"), 
                      Gene=c("Gene1","Gene2","Gene3"),
                      sample1=c(11397,78191,44838),
                      sample2=c(33768,33763,7680),
                      sample3=c(74521,33268,72367),
                      sample4=c(51486,11435,28772),
                      sample5=c(73539,21486,0))

> test_df
##       ID  Gene sample1 sample2 sample3 sample4 sample5
##1 ID_3721 Gene1   11397   33768   74521   51486   73539
##2  ID_537 Gene2   78191   33763   33268   11435   21486
##3  ID_555 Gene3   44838    7680   72367   28772       0

for(i in 1:2){
       for(j in i+1:3){

          p.cor <- cor.test(as.matrix(test_df[i,][,3:length(test_df)]), as.matrix(test_df[j,][,3:length(test_df)]), method="pearson")$estimate
          s.cor <- cor.test(as.matrix(test_df[i,][,3:length(test_df)]), as.matrix(test_df[j,][,3:length(test_df)]), method="spearman")$estimate

          df1 <- data.frame(ID1   = test_df[i,1],
                            ID2   = test_df[j,1],
                            Name1 = test_df[i,2],
                            Name2 = test_df[j,2],
                            correlation.p = p.cor
                            correlation.s = s.cor)

         write.table(df1, file="genecorr.txt", row.names=FALSE, sep="\t", append=TRUE, quote=FALSE, col.names = !file.exists("genecorr.txt"))

   }
}

**Error in cor.test.default(as.matrix(test_df[i, ][, 3:length(test_df)]),  : 
  not enough finite observations**

genecorr.txt

ID1     ID2     NAME1   NAME2    correlation.p      correlation.s
ID_3721 ID_537  Gene1   Gene2    -0.136733508500744  -0.1
ID_3721 ID_555  Gene1   Gene3    0.145998550191942    0.3

you don't need cor.test if you are just interested with the estimates. cor might be enough and faster probably — DJJ
you are saving genecorr.txt at each iteration. This might lengthen the computation a lot. I would suggest to save once the correlations are computed. — DJJ

DJJ DJJ · Accepted Answer · 2020-03-27T16:59:46

I would suggest first to transform your data in the following way

 dt <- dcast(melt(id.vars=c("ID","Gene"),test_df),variable~Gene)

setDT(dt)

## > dt
##    variable Gene1 Gene2 Gene3
## 1:  sample1 11397 78191 44838
## 2:  sample2 33768 33763  7680
## 3:  sample3 74521 33268 72367
## 4:  sample4 51486 11435 28772
## 5:  sample5 73539 21486     0




nameidx <- combn(names(dt)[-1],2)
 ## > nameidx
 ##      [,1]    [,2]    [,3]   
 ## [1,] "Gene1" "Gene1" "Gene2"
 ## [2,] "Gene2" "Gene3" "Gene3"

notice how easy it is to produce the name index with the function combn. This way can help you avoid the double loop. you can choose to go with the ID instead of the name if name is not unique

Now it is just a matter of going through the name idx

res  <- dt[,lapply(1:ncol(nameidx),
         function(x){ c(pearson=cor.test(get(nameidx[1,x]),
                                    get(nameidx[2,x]),method="pearson")$estimate,
         spearman=cor.test(get(nameidx[1,x]),
                           get(nameidx[2,x]),method="spearman")$estimate)})]

## >  > res
##            V1        V2        V3
## 1: -0.7411691 0.0394641 0.3444608
## 2: -0.6000000 0.1000000 0.3000000

Then we can finish it with

 ## > res1 <- setnames(data.table(cbind(t(nameidx),t(res))),c("Name1","Name2","pearson","spearman"))[]
 ## > res1
 ##    Name1 Name2            pearson spearman
 ## 1: Gene1 Gene2 -0.741169112323627     -0.6
 ## 2: Gene1 Gene3 0.0394640960151169      0.1
 ## 3: Gene2 Gene3  0.344460833012615      0.3

calculating correlations all against all genes in R: What would be the best approach to do it?

3 Answers