1
votes

I have a data set I am trying to fit with bam() in the mgcv package. The model has a binary outcome and I need to specify random intercepts for each animal ID. A subset of the data is below (my actual data is much larger with many more covariates):

dat2 <- read.csv('https://github.com/silasbergen/example_data/raw/main/dat2.csv')
dat2$Animal_id <- factor(dat2$Animal_id)
> head(dat2)
  Animal_id DEM_IA Anyrisk
1       105 279.94       0
2       105 278.68       0
3       106 329.13       0
4       106 329.93       0
5       106 332.25       0
6       106 333.52       0
> summary(dat2)
 Animal_id        DEM_IA         Anyrisk      
 105:     2   Min.   :156.3   Min.   :0.0000  
 106: 83252   1st Qu.:246.8   1st Qu.:0.0000  
 107: 22657   Median :290.1   Median :0.0000  
 108:104873   Mean   :284.8   Mean   :0.3619  
 109:142897   3rd Qu.:318.0   3rd Qu.:1.0000  
 110: 53967   Max.   :411.8   Max.   :1.0000 

I want to fit the model and predict for new data without the random effect:

library(mgcv)
mod <- bam(Anyrisk ~s(Animal_id,bs="re") + s(DEM_IA), data = dat2, family = "binomial",discrete=TRUE)
topred <-  data.frame(DEM_IA = c(280,320))
predict(mod, newdata = topred, exclude="s(Animal_id)",newdata.guaranteed = TRUE)

But this throws an error:

Error in eval(predvars, data, env) : object 'Animal_id' not found

Why would it need Animal_id when I am specifically telling it to exclude this term from the prediction? This is also especially strange as I can run the similar examples in the ?random.effects mgcv help file, no problem, even if I modify those examples to use bam() instead of gam()! Any help would be greatly appreciated!

EDIT

I may have found a fix; apparently if using discrete=TRUE in the bam() model, then predict.bam() also uses discrete=TRUE which will not work with missing random effect, but this works:

mod<- bam(Anyrisk ~s(Animal_id,bs="re") + s(DEM_IA), data = dat2, family = "binomial",discrete=TRUE)
topred <-  data.frame(DEM_IA = c(280,320))
predict(mod,topred, exclude="s(Animal_id)",newdata.guaranteed = TRUE,discrete=FALSE)

Output:

         1          2 
-0.4451066 -0.0285989 
1
mgcv needs to build all the model terms by evaluating the bases before it can generate predictions. It would make the code more complex if at every evaluation of the bases you had to check whether you were including or excluding those terms etc, when all that is needed to ensure that is to zero out groups of columns from the Xp matrix immediately prior to computing the predicted values. Hence you always have to provide something for all terms used in the model, even though mgcv will later exclude the term; in your case just pass any one value of the possible levels for the ranef.Gavin Simpson
I would really recommend you don't use newdata.guaraneteed unless you are absolutely certain you have created newdata correctly. While you can omit a term if it's not needed because of exclude, and Simon suggests this as one way to ignore ranefs, as you've found this doesn't work all the time with every model. Providing something for all terms in the model will work everywhere.Gavin Simpson
I discovered that if discrete=TRUE is used in the predict() function, random effects are not ignored even if exclude = is specified. Note: > topred <- data.frame(DEM_IA = c(280,280),Animal_id = dat2$Animal_id[c(1,3)]) > predict(mod,topred, exclude="s(Animal_id)",discrete=TRUE) 1 2 -0.8157566 -0.7150832 > predict(mod,topred, exclude="s(Animal_id)",discrete=FALSE) 1 2 -0.4451066 -0.4451066 Silas Bergen
That was supposed to have been fixed in 1.8-32; check you are running version 1.8-33 of mgcvGavin Simpson
I'm using 1.8-31! I'll update; thanks.Silas Bergen

1 Answers

3
votes

tl;dr work around this by putting something in for Animal_id, it doesn't matter what value you specify (not NA though ...)

Why? Can't say for sure without more digging in the code, but ... it's often convenient to use model.frame(formula, newdata) as a step toward computing the required model matrix. (For example, one could proceed by constructing the whole model matrix, then zeroing out columns to be ignored ...) Figuring out which terms can be deleted from the formula may be a separate, more difficult step. (I don't know why it works differently in bam and gam though ...)

This appears to work fine:

topred <-  data.frame(DEM_IA = c(280,320),
                      Animal_id=dat2$Animal_id[1])
predict(mod, newdata = topred, exclude="s(Animal_id)",newdata.guaranteed = TRUE)

Check that it really doesn't matter what you specify for Animal_id:

res <- lapply(levels(dat2$Animal_id),
           function(i) {
             dd <- transform(topred, Animal_id=i)
               predict(mod, newdata = dd, 
                       exclude="s(Animal_id)",newdata.guaranteed = TRUE)
           })
do.call(rbind,res)

Results:

              1          2
[1,] -0.4451066 -0.0285989
[2,] -0.4451066 -0.0285989
[3,] -0.4451066 -0.0285989
[4,] -0.4451066 -0.0285989
[5,] -0.4451066 -0.0285989
[6,] -0.4451066 -0.0285989