1
votes

I am currently modelling gene expression over time using the BAM function from the MGCV package. There are certain genes which take a very long time to model, so I decided to try setting the discrete option to True. However, now the summaries of the outputs appear to vary massively between models. This is especially evident in the significance of the random effects terms and the total deviance explained by the model.

Is setting discrete to true somehow impacting the modelling of the random effects?

The models were fit as seen here:

 bam(value ~ oGeno +
      s(age, bs = "gp", k = 8, m=2) +
      s(age, by = oGeno, bs ="gp", k = 8, m=1) +
      s(age, sex, bs = "re") +
      s(litter, bs = "re") +
      s(hash, bs = "re") +
      s(hpool, bs = "re"),
    data = data, family = family, method="REML"

 bam(value ~ oGeno +
      s(age, bs = "gp", k = 8, m=2) +
      s(age, by = oGeno, bs ="gp", k = 8, m=1) +
      s(age, sex, bs = "re") +
      s(litter, bs = "re") +
      s(hash, bs = "re") +
      s(hpool, bs = "re"),
    data = data, family = family, discrete = T, method="fREML"

and generate these summaries

When discrete is false:

Family: Tweedie(p=1.32) 
Link function: log 

Formula:
value ~ oGeno + s(age, bs = basis, k = k, m = 2) + s(age, by = oGeno, 
    bs = basis, k = k, m = 1) + s(age, sex, bs = "re") + s(litter, 
    bs = "re") + s(hash, bs = "re") + s(hpool, bs = "re")

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.78886    0.05635  31.744   <2e-16 ***
oGeno.L     -0.05029    0.04337  -1.159    0.246    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                      edf  Ref.df       F  p-value    
s(age)           4.813266   4.864  48.547  < 2e-16 ***
s(age):oGenomut  3.100540   3.188   7.298 7.00e-05 ***
s(age,sex)       0.005974   1.000   0.039    0.188    
s(litter)       13.606287  89.000   5.819    0.199    
s(hash)         82.078473 148.000  38.136 8.22e-06 ***
s(hpool)        24.505710  28.000 654.942  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.259   Deviance explained = 33.5%
-REML =  27952  Scale est. = 1.9063    n = 15421

and when discrete is True:

Family: Tweedie(p=1.32) 
Link function: log 

Formula:
value ~ oGeno + s(age, bs = basis, k = k, m = 2) + s(age, by = oGeno, 
    bs = basis, k = k, m = 1) + s(age, sex, bs = "re") + s(litter, 
    bs = "re") + s(hash, bs = "re") + s(hpool, bs = "re")

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.77812    0.05850  30.393   <2e-16 ***
oGeno.L     -0.07017    0.04388  -1.599     0.11    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                    edf  Ref.df       F p-value    
s(age)           4.8187   4.870  31.101 < 2e-16 ***
s(age):oGenomut  3.1031   3.192   4.819 0.00108 ** 
s(age,sex)       0.3756   1.000 122.771 0.98174    
s(litter)       13.3135  93.000  68.682 1.00000    
s(hash)         81.8371 152.000 140.906 1.00000    
s(hpool)        24.5326  29.000 698.652 1.00000    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.259   Deviance explained = 27.6%
fREML =  28369  Scale est. = 1.9063    n = 15421
1

1 Answers

0
votes

What happens if you use the default method = "fREML" in the first model fit, without discrete? The model fits don't look very different in terms of the EDFs, so I suspect there is some rank deficiency in the fits that "REML" can handle but either "fREML" or the discrete = TRUE part can't. Hence check with the bam() for the fit without discretisation using method = "fREML".

Because you are using gp smooths, the m bit isn't doing what you think it is; these smooths don't have derivative based penalties. Instead, m = 1 is the default, fitting a GP with a spherical covariance function. As such, I suspect the global plus subject specific smooths are highly correlated and that is causing fitting problems for one of the algorithms.

I don't see any reason why you would want a GP here. If speed is the concern, use bs = "cr" for simple cubic regression splines. and as those have derivative-based penalties, m = 1 should help to avoid the issues with fitting multiple smooths of age.