
I am trying to fit an additive mixed model using bam (mgcv library). My dataset has 10^6 observations from a longitudinal study on growth in 2.10^5 children nested in 300 health centers. I am looking for the slope for each center. The model is

bam(haz ~ s(month, bs = "cc", k = 12)+ sex+ s(age)+ center+ year+ year*center+s(child, bs="re"), data)

Whenever, when I try to fit the model the following error message appears:

Error: cannot allocate vector of size 99.6 Gb
In addition: Warning message:
In matrix(by, n, q) : data length exceeds size of matrix

I am working on a cluster with 500 Gb de RAM.

Thank you for any help

Solutions to this are either very general (get more RAM) or very, very specific to your particular modeling task. e.g. see stackoverflow.com/q/10917532/324364joran

1 Answers


To diagnose more precisely where the problem is, try fitting your model with various terms left out. There are several terms in the model that could blow up on you:

  • the fixed effects involving center will blow up to 300 columns * 10^6 rows; depending on whether year is numeric or a factor, the year*center term could blow up to 600 columns or (nyears*300) columns
  • it's not clear to me whether bam uses sparse matrices for s(.,bs="re") terms; if not, you'll be in big trouble (2*10^5 columns * 10^6 rows)

Order of magnitude, a vector of 10^6 numeric values (one column of your model matrix) takes 7.6 Mb, so 500 GB / 7.6 MB would be approximately 65,000 columns ...

Just taking a guess here, but I would try out the gamm4 package. It's not specifically geared for low-memory use, but:

‘gamm4’ is most useful when the random effects are not i.i.d., or when there are large numbers of random coeffecients [sic] (more than several hundred), each applying to only a small proportion of the response data.

I would also make most of the terms into random effects:

gamm4::gamm4(haz ~ s(month, bs = "cc", k = 12)+ sex+ s(age)+ 
 (1|center)+ (1|year)+ (1|year:center)+(1|child), data)

or, if there are not very many years in the data set, treat year as a fixed effect:

gamm4::gamm4(haz ~ s(month, bs = "cc", k = 12)+ sex+ s(age)+ 
 year + (1|center)+ (1|year:center)+(1|child), data)

If there are a small number of years then (year|center) might make sense, to assess among-center variation and covariation among years ... if there are many years, consider making it a smooth term instead ...