I have a question about significance and differences in significance when I use an interaction plus the family = binomial argument in my glm model and when I leave it out. I am very new to logistic regression, and have only done more simple linear regression in the past.
I have a dataset of observations of tree growth rings, with two categorical explanatory variables (Treatment and Origin). The Treatment variable is an experimental drought treatment with four levels (Control, First Drought, Second Drought, and Two Droughts). The Origin variable has three levels and refers to the tree's origin (given code colors to signify the different origins as Red, Yellow, and Blue). My observations are whether a growth ring is present or not (1 = growth ring present, 0 = no growth ring).
In my case, I am interested in the effect of Treatment, the effect of Origin, and also the possible interaction of Treatment and Origin.
It has been suggested that binomial logistic regression would be a good method for analyzing this data set. (Hopefully that is appropriate? Maybe there are better methods?)
I have n = 5 (5 observations for each combination of Treatment by Origin. So, for example, 5 observations of growth rings for the Control Treatment Blue Origin trees, 5 observations for the Control Treatment Yellow Origin trees, etc.) So in total there are 60 observations of growth rings in the dataset.
In R, the code that I've used is the glm() function. I've set it up as follows: growthring_model <- glm(growthringobs ~ Treatment + Origin + Treatment:Origin, data = growthringdata, family = binomial(link = "logit"))
I've factored my explanatory variables so that the Control treatment and the Blue origin trees are my reference.
What I notice is that when I leave the "family = binomial" argument out of the code, it gives me p-values that I would reasonably expect given the results of the data. However, when I add the "family = binomial" argument, the p-values are 1 or very close to 1 (1, 0.98, 0.99, for example). This seems odd. I could see there being low significance, but that the values are ALL so near to 1 makes me suspicious given my actual data. If I run the model without using the "family = binomial" argument, I get p-values that seem to make more sense (even though they are still relatively high/insignificant).
Can someone help me to understand how the binomial argument is shifting my results so much? (I understand that it is referring to the distribution, i.e. my observations are either 1 or 0) What exactly is it changing in the model? Is this a result of low sample size? Is there something in my code? Maybe those very high-values are correct (or not?)?
Here is a read out of my model summary with the binomial argument present: Call: glm(formula = Growthring ~ Treatment + Origin + Treatment:Origin, family = binomial(link = "logit"), data = growthringdata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.79412 -0.00005 -0.00005 -0.00005 1.79412
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.057e+01 7.929e+03 -0.003 0.998
TreatmentFirst Drought -9.931e-11 1.121e+04 0.000 1.000
TreatmentSecond Drought 1.918e+01 7.929e+03 0.002 0.998
TreatmentTwo Droughts -1.085e-10 1.121e+04 0.000 1.000
OriginYellow 1.918e+01 7.929e+03 0.002 0.998
OriginRed -1.045e-10 1.121e+04 0.000 1.000
TreatmentFirst Drought:OriginYellow -1.918e+01 1.373e+04 -0.001 0.999
TreatmentSecond Drought:OriginYellow -1.739e+01 7.929e+03 -0.002 0.998
TreatmentTwo Droughts:OriginYellow -1.918e+01 1.373e+04 -0.001 0.999
TreatmentFirst Drought:OriginRed 1.038e-10 1.586e+04 0.000 1.000
TreatmentSecond Drought:OriginRed 2.773e+00 1.121e+04 0.000 1.000
TreatmentTwo Droughts:OriginRed 2.016e+01 1.373e+04 0.001 0.999
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 57.169 on 59 degrees of freedom
Residual deviance: 28.472 on 48 degrees of freedom
AIC: 52.472
Number of Fisher Scoring iterations: 19
And here is a read out of my model summary without the binomial argument: Call: glm(formula = Growthring ~ Treatment + Origin + Treatment:Origin, data = growthringdata)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8 0.0 0.0 0.0 0.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.278e-17 1.414e-01 0.000 1.0000
TreatmentFirst Drought 3.145e-16 2.000e-01 0.000 1.0000
TreatmentSecond Drought 2.000e-01 2.000e-01 1.000 0.3223
TreatmentTwo Droughts 1.152e-16 2.000e-01 0.000 1.0000
OriginYellow 2.000e-01 2.000e-01 1.000 0.3223
OriginRed 6.879e-17 2.000e-01 0.000 1.0000
TreatmentFirst Drought:OriginYellow -2.000e-01 2.828e-01 -0.707 0.4829
TreatmentSecond Drought:OriginYellow 2.000e-01 2.828e-01 0.707 0.4829
TreatmentTwo Droughts:OriginYellow -2.000e-01 2.828e-01 -0.707 0.4829
TreatmentFirst Drought:OriginRed -3.243e-16 2.828e-01 0.000 1.0000
TreatmentSecond Drought:OriginRed 6.000e-01 2.828e-01 2.121 0.0391 *
TreatmentTwo Droughts:OriginRed 4.000e-01 2.828e-01 1.414 0.1638
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.1)
Null deviance: 8.9833 on 59 degrees of freedom
Residual deviance: 4.8000 on 48 degrees of freedom
AIC: 44.729
Number of Fisher Scoring iterations: 2
(I apologize in advance for the possible simplicity of my question. I've tried to read up on logistic regression and tried to follow some examples. But I have struggled to find answers addressing my particular situation)
Thanks so much.
type = "response"
when runningpredict
on the binomial model) I think that will help you make up your mind about which model makes sense. – Gregor Thomas