1
votes

I have a dateset plink.raw and I am testing if a maker CN00020133 (has three level 0, 1, 2) is associated with phenotype5. I want to compare 0 vs 1 and 2 vs 1 using GLM or fisher extract test:

    table(plink.raw$phenotype5,plink.raw$CN00020133)
              1     0     2
        0  3559     0     7
        1 14806    54   123
  1. tested using GLM, I can see the p value for 0 vs 1 is 0.912894.

    plink.raw$CN00020133 <- factor(plink.raw$CN00020133, levels=c("1","0","2"))
    univariate=glm(phenotype5 ~ relevel(CN00020133,ref ="1"), family = binomial, data = plink.raw)
    summary(univariate)
    

    Call:

    glm(formula = phenotype5 ~ relevel(CN00020133, ref = "1"), 
        family = binomial, data = plink.raw)
    

    Deviance Residuals:

        Min       1Q   Median       3Q      Max  
    -2.4173   0.6564   0.6564   0.6564   0.6564  
    

    Coefficients:

                                   Estimate Std. Error z value Pr(>|z|)    
    (Intercept)                         1.42555    0.01867  76.361  < 2e-16 ***
    relevel(CN00020133, ref = "1")0  13.14051  120.12616   0.109 0.912894    
    relevel(CN00020133, ref = "1")2   1.44072    0.38902   3.703 0.000213 ***
    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
    Null deviance: 18158  on 18548  degrees of freedom
    Residual deviance: 18114  on 18546  degrees of freedom
    AIC: 18120
    
    Number of Fisher Scoring iterations: 13
    
  2. But if I tested it using fisher exact test, p value for 0 vs 1 is 4.618e-06.

    Convictions <- matrix(c(0, 54, 3559, 14806), nrow = 2,dimnames = list(c("control", "case"),c("del/dup", "normal_copy")))
    fisher.test(Convictions, alternative = "less")
    
    Fisher's Exact Test for Count Data
    data:  Convictions
    p-value = 9.048e-06
    alternative hypothesis: true odds ratio is less than 1
    95 percent confidence interval:
    0.0000000 0.2374411
    sample estimates:
    odds ratio 0 
    
1

1 Answers

0
votes

You have a case of complete separation, also known as The Hauck-Donner effect occurs. If you CN00020133==0, all of them have 1 as phenotype, and this makes it hard to estimate the standard error of the coefficient. There's a fair amount of material on it, for example Alexej's blog, post by Brian Ripley, Ben Bolker's notes.

If you need to test for significance of the effect of "1", one solution is to use the likelihood ratio test:

df =  rbind(
data.frame(phenotype=rep(0:1,c(3559,14806)),CN00020133="1"),
data.frame(phenotype=rep(0:1,c(0,54)),CN00020133="0")
)

anova(glm(phenotype ~ CN00020133,data=df,family=binomial),test="Chisq")

Analysis of Deviance Table

Model: binomial, link: logit

Response: phenotype

Terms added sequentially (first to last)


           Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
NULL                       18418      18082             
CN00020133  1   23.227     18417      18059 1.44e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This gives you different p-value compared to fisher test because it's a different distribution (binomial vs hyper-geometric). But more or less you can conclude there is an added effect of "1" using "0" as reference.

There is a implementation of Firth's logistic regression in R, you can try and see but I must say I am not very familiar with this:

library("logistf")
logistf(phenotype ~ CN00020133,data=df,family=binomial)