3
votes

I'm performing pairwise Wilcoxon test with a simple data set and get surprising results. Comparing group A and C using the full set (A, B and C) returns a p-value different than comparing A and C with a data subset (A and C groups only).

dfx <- data.frame(group = c(rep('A', 8), rep('B', 15), rep('C', 6)), sex = sample(c("M", "F"), size = 29, replace = TRUE), age = runif(n = 29, min = 18, max = 54))
pairwise.wilcox.test(dfx$age, dfx$group, pool.sd=F, paired=F)

Pairwise comparisons using Wilcoxon rank sum test 
data:  dfx$age and dfx$group 
  A    B   
B 0.55 -   
C 0.13 0.19
P value adjustment method: holm 

dfx.ac<-dfx[which(dfx$group!='B'),]
pairwise.wilcox.test(dfx.ac$age, dfx.ac$group, pool.sd=F, paired=F)

Pairwise comparisons using Wilcoxon rank sum test 
data:  dfx.ac$age and dfx.ab$group 
  A    
C 0.043
P value adjustment method: holm 

Using Wilcoxon test on individual data returns the same p-value.

a<-dfx[which(dfx$group=='A'),]$age
c<-dfx[which(dfx$group=='C'),]$age
wilcox.test(a,c)
W = 8, p-value = 0.04262

wilcox.test(dfx.ac$age~dfx.ac$group)
W = 8, p-value = 0.04262

What do I do wrong with pairwise.wilcox.test(dfx$age, dfx$group, pool.sd=F, paired=F) ?

Same difference if I compare a set of 3 groups or 4 groups. dfx <- data.frame( group = c(rep('A', 8), rep('B', 15), rep('C', 6), rep('D', 9)), sex = sample(c("M", "F"), size = 38, replace = TRUE), age = runif(n = 38, min = 18, max = 54))

dfx.nb<-dfx[which(dfx$group!='B'),]

pairwise.wilcox.test(dfx$age,dfx$group, pool.sd=F, paired=F)
  A    B    C   
B 1.00 -    -   
C 0.57 0.62 -   
D 0.56 0.56 1.00

pairwise.wilcox.test(dfx.nb$age,dfx.nb$group, pool.sd=F, paired=F)
  A    C   
C 0.28 -   
D 0.28 0.95
1

1 Answers

4
votes

Check ?pairwise.wilcox.test. This function implements a correction for multiple comparison which accounts for the difference.

Edited to add:

When you make a single comparison, then a p-value of P can be interpreted to mean that the probability of obtaining the observed data under the null hypothesis of no effect equals P. So if we want to make sure that the probability of observing our data under the null hypothesis of no effect is less than 0.05, we simply have to check whether or not p<0.05 (in statistics, we call the threshold for significance alpha, and frequently are interested in whether p < alpha=0.05). But if you compare a large number of groups, then the probability that you will find at least one p-value less than alpha is much greater than alpha!

For example, imagine I do two independent comparisons. If the null hypothesis is true, then each has a probability 0.05 of yielding a p-value less than alpha=0.05. But the chance that at least one of them will yield a p-value less than alpha=0.05 is 1-0.95*0.95 = 0.0975. So we have nearly a 1-in-10 chance of a type I error (if the null hypothesis is correct). If we do 100 comparisons, we can be almost certain that we will commit a type I error if we interpret our p-values uncritically.

To avoid this, we can "adjust" the p-values to ensure that we control the type I error rate. The most conservative option, known as Bonferroni correction, is to adjust alpha by dividing it by the number of comparisons we perform (which has an effect similar to multiplying our p-values by the number of comparisons).

pairwise.wilcox.test is automatically adjusting the p-values so that they can be interpreted against alpha=0.05 without badly inflating the type I error rate. The more comparisons you make, the more the p-values need to be adjusted in order to control the type 1 error rate. [As an aside, note that the correction is a bit complicated, because the three comparisons are no longer fully independent of one another (you can see this by noting that (a > b) & (b > c) implies a > c). In practice, we usually don't take this non-independence into account when we perform the adjustments.]

The adjustment method defines how pairwise.wilcox.test estimates the necessary adjustment. Changing the adjustment method would change the details of how the necessary adjustment is estimated, and could change your p-values. The adjustment method Holm presumably implements the Holm-Bonferroni method.