Binomial logistic regression with categorical predictors and interaction (binomial family argument and p-value differences)

Question

I have a question about significance and differences in significance when I use an interaction plus the family = binomial argument in my glm model and when I leave it out. I am very new to logistic regression, and have only done more simple linear regression in the past.

I have a dataset of observations of tree growth rings, with two categorical explanatory variables (Treatment and Origin). The Treatment variable is an experimental drought treatment with four levels (Control, First Drought, Second Drought, and Two Droughts). The Origin variable has three levels and refers to the tree's origin (given code colors to signify the different origins as Red, Yellow, and Blue). My observations are whether a growth ring is present or not (1 = growth ring present, 0 = no growth ring).

In my case, I am interested in the effect of Treatment, the effect of Origin, and also the possible interaction of Treatment and Origin.

It has been suggested that binomial logistic regression would be a good method for analyzing this data set. (Hopefully that is appropriate? Maybe there are better methods?)

I have n = 5 (5 observations for each combination of Treatment by Origin. So, for example, 5 observations of growth rings for the Control Treatment Blue Origin trees, 5 observations for the Control Treatment Yellow Origin trees, etc.) So in total there are 60 observations of growth rings in the dataset.

In R, the code that I've used is the glm() function. I've set it up as follows: growthring_model <- glm(growthringobs ~ Treatment + Origin + Treatment:Origin, data = growthringdata, family = binomial(link = "logit"))

I've factored my explanatory variables so that the Control treatment and the Blue origin trees are my reference.

What I notice is that when I leave the "family = binomial" argument out of the code, it gives me p-values that I would reasonably expect given the results of the data. However, when I add the "family = binomial" argument, the p-values are 1 or very close to 1 (1, 0.98, 0.99, for example). This seems odd. I could see there being low significance, but that the values are ALL so near to 1 makes me suspicious given my actual data. If I run the model without using the "family = binomial" argument, I get p-values that seem to make more sense (even though they are still relatively high/insignificant).

Can someone help me to understand how the binomial argument is shifting my results so much? (I understand that it is referring to the distribution, i.e. my observations are either 1 or 0) What exactly is it changing in the model? Is this a result of low sample size? Is there something in my code? Maybe those very high-values are correct (or not?)?

Here is a read out of my model summary with the binomial argument present: Call: glm(formula = Growthring ~ Treatment + Origin + Treatment:Origin, family = binomial(link = "logit"), data = growthringdata)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.79412  -0.00005  -0.00005  -0.00005   1.79412  

Coefficients:
                                       Estimate Std. Error z value Pr(>|z|)
(Intercept)                          -2.057e+01  7.929e+03  -0.003    0.998
TreatmentFirst Drought               -9.931e-11  1.121e+04   0.000    1.000
TreatmentSecond Drought               1.918e+01  7.929e+03   0.002    0.998
TreatmentTwo Droughts                -1.085e-10  1.121e+04   0.000    1.000
OriginYellow                          1.918e+01  7.929e+03   0.002    0.998
OriginRed                            -1.045e-10  1.121e+04   0.000    1.000
TreatmentFirst Drought:OriginYellow  -1.918e+01  1.373e+04  -0.001    0.999
TreatmentSecond Drought:OriginYellow -1.739e+01  7.929e+03  -0.002    0.998
TreatmentTwo Droughts:OriginYellow   -1.918e+01  1.373e+04  -0.001    0.999
TreatmentFirst Drought:OriginRed      1.038e-10  1.586e+04   0.000    1.000
TreatmentSecond Drought:OriginRed     2.773e+00  1.121e+04   0.000    1.000
TreatmentTwo Droughts:OriginRed       2.016e+01  1.373e+04   0.001    0.999

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 57.169  on 59  degrees of freedom
Residual deviance: 28.472  on 48  degrees of freedom
AIC: 52.472

Number of Fisher Scoring iterations: 19

And here is a read out of my model summary without the binomial argument: Call: glm(formula = Growthring ~ Treatment + Origin + Treatment:Origin, data = growthringdata)

Deviance Residuals: 
Min      1Q  Median      3Q     Max  
-0.8     0.0     0.0     0.0     0.8  

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)  
(Intercept)                          -4.278e-17  1.414e-01   0.000           1.0000  
TreatmentFirst Drought                3.145e-16  2.000e-01   0.000   1.0000  
TreatmentSecond Drought               2.000e-01  2.000e-01   1.000   0.3223  
TreatmentTwo Droughts                 1.152e-16  2.000e-01   0.000   1.0000  
OriginYellow                          2.000e-01  2.000e-01   1.000   0.3223  
OriginRed                             6.879e-17  2.000e-01   0.000   1.0000  
TreatmentFirst Drought:OriginYellow  -2.000e-01  2.828e-01  -0.707   0.4829  
TreatmentSecond Drought:OriginYellow  2.000e-01  2.828e-01   0.707   0.4829  
TreatmentTwo Droughts:OriginYellow   -2.000e-01  2.828e-01  -0.707   0.4829  
TreatmentFirst Drought:OriginRed     -3.243e-16  2.828e-01   0.000   1.0000  
TreatmentSecond Drought:OriginRed     6.000e-01  2.828e-01   2.121   0.0391 *
TreatmentTwo Droughts:OriginRed       4.000e-01  2.828e-01   1.414   0.1638  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.1)

    Null deviance: 8.9833  on 59  degrees of freedom
Residual deviance: 4.8000  on 48  degrees of freedom
AIC: 44.729

Number of Fisher Scoring iterations: 2

(I apologize in advance for the possible simplicity of my question. I've tried to read up on logistic regression and tried to follow some examples. But I have struggled to find answers addressing my particular situation)

Thanks so much.

This isn't a programming question. For statistics help, go to stats.stackexchange. — Gregor Thomas
Also make sure you are looking at the correct columns, in your second block of pasted output the formatting seems a bit off. It is still the 4th column of numbers that are the p-values. There is only one that is less than 0.05. I would suggest looking at some model predictions on dummy data. (make sure you use type = "response" when running predict on the binomial model) I think that will help you make up your mind about which model makes sense. — Gregor Thomas
Hi Gregor. Ah, sorry and thanks so much, I will switch this to stats.stackexchange. Yes, I think you are also right about the formatting. I will try out some model predictions as you suggest. — jamos0173

Marjolein Fokkema Marjolein Fokkema · Accepted Answer · 2019-08-04T13:24:25

In line with Gregor's comment above, one could interpret this as a programming question. If you leave out family = binomial, function glm() will employ the default family = gaussian, implying an identity link function and assuming normal, homoscedastic errors. See also ?glm.

The assumption of normal and/or homoscedastic errors is likely violated here. Thus, the standard errors and p-values of the second model shown here are likely incorrect.

Binomial logistic regression with categorical predictors and interaction (binomial family argument and p-value differences)

1 Answers