I'm using the R package 'mgcv' to create GAMs for predicting seabird distributions. I'm having some problems interpreting the results from gam.check() for some of my models (one model per species).
I'm using a negative binomial family and log link, with an offset of log of the sample area. I have 13 possible predictor variables. My response data is discrete count data from surveys, it is highly zero-inflated.
In each candidate I check for concurvity (using 0.8 as a threshold in the entire model), and that my model terms are significant using summary(),(otherwise they are dropped), and the model converges each time. Despite this, I am still getting highly significant results from gam.check(). This is in spite of the edf being much lower than k. I have tried increasing k a lot but it doesn't make a difference to the results nor often to the edf, I have tried using a tweedie and zero-inflated poisson (ziP) families too but that also doesn't make much of a difference. I will say, however, that my QQ plots do not look fantastic.
Does anyone know why this might occur? I assume the model is unstable and should not be used, however I have no idea how to fix it. It's happening for almost all of the species too, so it's not specific to one file.
Here is my model
gam_nb <- gam(COUNT_SUM ~ s(X_MEAN_2, k=25) + s(Currents_Northward, k=25),
data=my_data,family = "nb", select=TRUE, offset=LOG_AREA_SUM, method = "REML")
And here are the diagnostic results
Method: REML Optimizer: outer newton
full convergence after 7 iterations.
Gradient range [-0.002334867,0.0005403022]
(score 1783.03 & scale 1).
Hessian positive definite, eigenvalue range [1.846342e-06,121.4505].
Model rank = 49 / 49
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(X_MEAN_2) 24.0 11.4 0.66 <2e-16 ***
s(Currents_Northward) 24.0 11.8 0.69 0.025 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Now see what happens when I increase k
gam_nb <- gam(COUNT_SUM ~ s(X_MEAN_2, k=100) + s(Currents_Northward, k=100),
data=my_data,family = "nb", select=TRUE, offset=LOG_AREA_SUM, method = "REML")
Method: REML Optimizer: outer newton
full convergence after 7 iterations.
Gradient range [-6.715822e-05,1.121576e-05]
(score 1783.004 & scale 1).
Hessian positive definite, eigenvalue range [7.32924e-06,120.848].
Model rank = 199 / 199
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(X_MEAN_2) 99.0 12.0 0.67 <2e-16 ***
s(Currents_Northward) 99.0 13.1 0.68 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1