I am trying to calculate the Spearman correlation and p-value for a data frame. For better p-value approxiamation, I must stick to the pspearman package. I am expecting a result similar with the rcorr()
function. But I have a problem when performing pspearman:spearman.test()
row by row.
My dataframe contains 5000 rows (genes), and 200 columns(spots). And I want to get a correlation matrix and p-value matrix for these 5000*5000 gene-gene pairs. The correlation is only calculated when both two genes are not NAs in more than two spots.
I can achieve this with loops but it is too slow for my big dataset. I have problems when I try to use apply(),sapply(),mapply()
to improve the speed.
This is what I've tried:
data = data.frame(matrix(rbinom(10*100000, 50, .5), ncol=200))
dim(data) #5000, 200
rownames(data) <- paste("gene", 1:5000, sep="")
colnames(data) <- paste("spot",1:200,sep='')
library(pspearman)
spearFunc = function(x,y=data) {
df = rbind(x,y)
# Check the number of complete spots.There are no NAs in this set.
complete = sum(!(is.na(x)) & !(is.na(y)))
if (complete >=2 ) {
pspearman::spearman.test(as.numeric(x),as.numeric(y))
# This function returns a list containing 8 values, like pvalue,correlation
}}
pair.all1 = mapply(spearFunc,data,data)
dim(pair.all1)
# 8 200, 200 is the number of columns
pair.all2 = apply(data,1,spearFunc)
Which results in error:
Error in pspearman::spearman.test(as.numeric(x), as.numeric(y)) : (list) object cannot be coerced to type 'double'
I hope to use spearman.test for every gene pair with apply() to do
spearman.test(data[gene1],data[gene1])
spearman.test(data[gene1],data[gene2])
....
spearman.test(data[gene1],data[gene5000])
...
spearman.test(data[gene5000],data[gene5000])
It should return a dataframe of 8 rows and 25,000,000 columns(5000*5000 gene pairs).
Is it possible to use apply() inside apply() to achieve my purpose?
Thx!