2
votes

I am calculating, hopefully right, the hypergeometric test per row in a data frame in R.

Where column 1 is names of genes (microRNAs), the column "Total_mRNAs" is how many mRNA exist in total in the genome so that doesn't change. Column "Total_targets_targets" is how many mRNAs each microRNA can target if all the mRNAs are present. However, for this example only "subset_mRNAs" are present (that number is also always the same) and among these I know how many mRNAs each microRNA can target "subset_targets".

In order to determine if targets for each microRNA are enriched compared to the background (total mRNAs and total microRNAs targeting them) I am performing a hypergeometric test per row like this:

phyper(targets-in-subset, targets-in-bkgd, failure-in-bkgd, sample-size-subset, lower.tail= FALSE)



dput(df1)
structure(list(Genes_names = c("microRNA-1", "microRNA-2", "microRNA-3", 
"microRNA-4", "microRNA-5", "microRNA-6", "microRNA-7", "microRNA-8", 
"microRNA-9", "microRNA-10"), Total_mRNAs = c(61064L, 61064L, 
61064L, 61064L, 61064L, 61064L, 61064L, 61064L, 61064L, 61064L
), Total_targets_targets = c(1918L, 7807L, 3969L, 771L, 2850L, 
1355L, 1560L, 2478L, 1560L, 2478L), subset_mRNAs = c(17571L, 
17571L, 17571L, 17571L, 17571L, 17571L, 17571L, 17571L, 17571L, 
17571L), subset_targets = c(544L, 2109L, 1137L, 213L, 793L, 394L, 
430L, 686L, 430L, 686L)), class = "data.frame", row.names = c(NA, 
-10L))


df1$pvalue <- phyper(df1$subset_targets, df1$Total_targets_targets, df1$Total_mRNAs-df1$Total_targets_targets, df1$subset_mRNAs, lower.tail= FALSE)

Now the question is how can I Bonferroni correct this values? Is this calculation theoretically right?

2
You can use the apply function for this. Please give more details on what you want. From your code listenings I do not know how you want to calculate what exactly.MacOS
Sorry I corrected the question lots. But ill try and explain it a bit better.Amaranta_Remedios
I don't understand where this df1$Total-df1$targets comes from.MacOS
df1$Total = all the mRNAs present in the genome of that specie 61064 df1$targets = all the targets each microRNA can theoretically have if all the mRNAs are present 1918 for the first one.Amaranta_Remedios

2 Answers

3
votes

If you have not a lot of samples, avoid this pain and simply use a fisher test and do bonferroni using p.adjust:

library(broom)
result = lapply(1:nrow(df1),function(i){
       not_target_subset = df1$Total_targets_targets[i] - df1$subset_targets[i]
       not_subset = df1$Total_mRNAs[i] - df1$subset_mRNAs[i] - not_target_subset
       
       
       M = cbind(c(df1$subset_targets[i],df1$subset_mRNAs[i]-df1$subset_targets[i]),
             c(not_target_subset,not_subset))
      
       res = data.frame(Genes_names=df1$Genes_names[i],
                tidy(fisher.test(M,alternative="greater")))

       return(res)
})

result= do.call(rbind,result)
result$padj = p.adjust(result$p.value,"bonferroni")

Your hyper-geometric code is slightly off. And note that you are doing one sided hyper-geometric test.

You can check this post for how to put the tables into phyper and this for why you need the -1 . So we calculate the hypergeometric p-value:

result$hyper_p = with(df1, 
phyper(subset_targets-1,subset_mRNAs,Total_mRNAs-subset_mRNAs, Total_targets_targets, lower.tail= FALSE)
)

And you can see it tallies:

   Genes_names  estimate   p.value  conf.low conf.high
1   microRNA-1 0.9793710 0.6655527 0.8984025       Inf
2   microRNA-2 0.9047305 0.9998968 0.8647701       Inf
3   microRNA-3 0.9933480 0.5791759 0.9350214       Inf
4   microRNA-4 0.9441864 0.7722712 0.8229140       Inf
5   microRNA-5 0.9520878 0.8789562 0.8863785       Inf
6   microRNA-6 1.0151760 0.4119600 0.9168998       Inf
7   microRNA-7 0.9404619 0.8641420 0.8539585       Inf
8   microRNA-8 0.9454359 0.8942082 0.8756678       Inf
9   microRNA-9 0.9404619 0.8641420 0.8539585       Inf
10 microRNA-10 0.9454359 0.8942082 0.8756678       Inf
                               method alternative   hyper_p
1  Fisher's Exact Test for Count Data     greater 0.6655527
2  Fisher's Exact Test for Count Data     greater 0.9998968
3  Fisher's Exact Test for Count Data     greater 0.5791759
4  Fisher's Exact Test for Count Data     greater 0.7722712
5  Fisher's Exact Test for Count Data     greater 0.8789562
6  Fisher's Exact Test for Count Data     greater 0.4119600
7  Fisher's Exact Test for Count Data     greater 0.8641420
8  Fisher's Exact Test for Count Data     greater 0.8942082
9  Fisher's Exact Test for Count Data     greater 0.8641420
10 Fisher's Exact Test for Count Data     greater 0.8942082
2
votes

Warning: The user that asked that question pointed out that the calculations in this answer might be wrong. Please see the comments below.

Based on your edit, it seems that you are looking for is

df1$new.column <- apply(df1,
                        margin = 1,
                        function(row),
                        {
                             return(phyper(row$targets.1, row$targets, sum(row$targets.1, row$targets), row$subset, lower.tail= FALSE))
                        }

EDIT

As pointed out in a comment by StupidWolf, phyper is vectorized. So, you can use (copied from the comment)

with(df1, phyper(targets.1, targets, sum(targets.1, targets), subset, lower.tail= FALSE)

HTH!