1
votes

I am trying to calculate the Spearman correlation and p-value for a data frame. For better p-value approxiamation, I must stick to the pspearman package. I am expecting a result similar with the rcorr() function. But I have a problem when performing pspearman:spearman.test() row by row.

My dataframe contains 5000 rows (genes), and 200 columns(spots). And I want to get a correlation matrix and p-value matrix for these 5000*5000 gene-gene pairs. The correlation is only calculated when both two genes are not NAs in more than two spots.

I can achieve this with loops but it is too slow for my big dataset. I have problems when I try to use apply(),sapply(),mapply() to improve the speed.

This is what I've tried:

data = data.frame(matrix(rbinom(10*100000, 50, .5), ncol=200))
dim(data) #5000, 200
rownames(data) <- paste("gene", 1:5000, sep="") 
colnames(data) <- paste("spot",1:200,sep='')

library(pspearman)
spearFunc = function(x,y=data) {
  df = rbind(x,y)
  # Check the number of complete spots.There are no NAs in this set.
  complete = sum(!(is.na(x)) & !(is.na(y)))
  if (complete >=2 ) {
    pspearman::spearman.test(as.numeric(x),as.numeric(y))
    # This function returns a list containing 8 values, like pvalue,correlation
    }}

pair.all1 = mapply(spearFunc,data,data)
dim(pair.all1)
# 8 200, 200 is the number of columns 
pair.all2 = apply(data,1,spearFunc) 

Which results in error:

Error in pspearman::spearman.test(as.numeric(x), as.numeric(y)) : (list) object cannot be coerced to type 'double'

I hope to use spearman.test for every gene pair with apply() to do

spearman.test(data[gene1],data[gene1]) 
spearman.test(data[gene1],data[gene2])
....
spearman.test(data[gene1],data[gene5000])
...
spearman.test(data[gene5000],data[gene5000])

It should return a dataframe of 8 rows and 25,000,000 columns(5000*5000 gene pairs).

Is it possible to use apply() inside apply() to achieve my purpose?

Thx!

1

1 Answers

0
votes

Consider creating pair-wise combinations of genes from row.names with combn and then iterating through the list of pairs through a defined function. Be sure to return an NA structure from if logic to avoid NULL in matrix output.

However, be forewarned that pair-wise permutations of 5,000 genes (choose(5000, 2)) results very high at 12,497,500 elements! Hence, sapply (a loop itself) may not be that different in performance than for. Look into parallelizing the iteration.

gene_combns <- combn(row.names(data), 2, simplify = FALSE)

spear_func <- function(x) {
  # EXTRACT ROWS BY ROW NAMES  
  row1 <- as.numeric(data[x[1],])
  row2 <- as.numeric(data[x[2],]) 

  # Check the number of complete spots.There are no NAs in this set.
  complete = sum(!(is.na(x)) & !(is.na(y)))

  if (complete >=2 ) {
    pspearman::spearman.test(row1, row2)        
  } else {
    c(statistic=NA, parameter=NA, p.value=NA, estimate=NA, 
      null.value=NA, alternative=NA,   method=NA, data.name=NA)
  }
}

pair.all2 <- sapply(gene_combns, spear_func)

Testing

Above has been tested with cor.test (exactly same input args and output list as spearman.test but more accurate p-value) using a small sample of dataset (50 obs, 20 vars):

set.seed(82418)
data <- data.frame(matrix(rbinom(10*100000, 50, .5), ncol=200))[1:50, 1:20]
rownames(data) <- paste0("gene", 1:50) 
colnames(data) <- paste0("spot", 1:20)

gene_combns <- combn(row.names(data), 2, simplify = FALSE)
# [[1]]
# [1] "gene1" "gene2"    
# [[2]]
# [1] "gene1" "gene3"    
# [[3]]
# [1] "gene1" "gene4"    
# [[4]]
# [1] "gene1" "gene5"    
# [[5]]
# [1] "gene1" "gene6"    
# [[6]]
# [1] "gene1" "gene7"

test <- sapply(gene_combns, spear_func)  # SAME FUNC BUT WITH cor.test
test[,1:5]

#             [,1]                              [,2]                             
# statistic   885.1386                          1659.598                         
# parameter   NULL                              NULL                             
# p.value     0.1494607                         0.2921304                        
# estimate    0.3344823                         -0.2478179                       
# null.value  0                                 0                                
# alternative "two.sided"                       "two.sided"                      
# method      "Spearman's rank correlation rho" "Spearman's rank correlation rho"
# data.name   "row1 and row2"                   "row1 and row2"                  
#             [,3]                              [,4]                             
# statistic   1554.533                          1212.988                         
# parameter   NULL                              NULL                             
# p.value     0.4767667                         0.7122505                        
# estimate    -0.1688217                        0.08797877                       
# null.value  0                                 0                                
# alternative "two.sided"                       "two.sided"                      
# method      "Spearman's rank correlation rho" "Spearman's rank correlation rho"
# data.name   "row1 and row2"                   "row1 and row2"                  
#             [,5]                             
# statistic   1421.707                         
# parameter   NULL                             
# p.value     0.7726922                        
# estimate    -0.06895299                      
# null.value  0                                
# alternative "two.sided"                      
# method      "Spearman's rank correlation rho"
# data.name   "row1 and row2"