As pointed out by Carl Witthoft, the standard parallelization tools in R use a shared memory model, so they will make things worse rather than better (their main purpose is to accelerate compute-bound jobs by using multiple processors).
In the short term, you might be able to save some memory by treating the categorical fixed-effect predictors (age
, atc
) as random effects but forcing their variances to be large. I don't know if this will be enough to save you or not; it will compress the fixed-effect model matrix a lot, but the model frame will still be stored/replicated with the model object ...
dd1 <- read.table(header=TRUE,
text="lho sex tc age atc patient
18 M 16.61 45-54 H 628143
7 F 10.52 12-15 G 2013855
30 M 92.73 35-44 N 2657693
19 M 24.92 70-74 G 2420965
12 F 17.44 65-69 A 2833610
31 F 7.03 75_and_over A 1090322
3 F 28.59 70-74 A 2718649
29 F 4.09 75_and_over C 384578
16 F 67.22 65-69 R 1579355
23 F 7.7 70-74 C 896374")
n <- 1e5
set.seed(101)
dd2 <- with(dd1,
data.frame(tc=rnorm(n,mean=mean(tc),sd=sd(tc)),
lho=round(runif(n,min=min(lho),max=max(lho))),
sex=sample(levels(sex),size=n,replace=TRUE),
age=sample(levels(age),size=n,replace=TRUE),
atc=sample(levels(atc),size=n,replace=TRUE),
patient=sample(1:1000,size=n,replace=TRUE)))
library("lme4")
m1 <- lmer(tc~sex+(1|lho)+(1|age)+(1|atc)+(1|patient),
data=dd2,REML=TRUE)
Random effects are automatically sorted in order from largest
to smallest number of levels. Following the machinery described
in the ?modular
help page:
lmod <- lFormula(tc~sex+(1|lho)+(1|age)+(1|atc)+(1|patient),
data=dd2,REML=TRUE)
names(lmod$reTrms$cnms)
devfun <- do.call(mkLmerDevfun, lmod)
wrapfun <- function(tt,bigsd=1000) {
devfun(c(tt,rep(bigsd,3)))
}
wrapfun(1)
opt <- optim(fn=wrapfun,par=1,method="Brent",lower=0,upper=1000)
opt$fval <- opt$value
res <- mkMerMod(environment(devfun), opt, lmod$reTrms, fr=lmod$fr)
res
You can ignore the reported variances for the categorical terms, and use
ranef()
to recover their (unshrunk) estimates.
In the long term, the proper way to do this problem is probably to parallelize it with a distributed-memory model. In other words, you would want to parcel the data out in chunks to different servers; use the machinery described in ?modular
to set up a likelihood function (actually a REML-criterion function) that gives the REML criterion for a subset of the data as a function of the parameters; then run a central optimizer that takes a set of parameters and evaluates the REML criterion by submitting the parameters to each server, retrieving the values from each server, and adding them. The only two problems I see with implementing this are (1) I don't actually know how to implement distributed-memory computation in R (based on this intro document it seems that the Rmpi/doMPI packages might be the right way to go); (2) in the default way that lmer
is implemented, the fixed-effects parameters are profiled out rather than being explicitly part of the parameter vector.
MixedModels
package/library for Julia? – Ben Bolkerparallel
andsnow
andbigdata
.Otherwise, even if someone writes a tool for you, you won't understand how to mod it or to find speed bottlenecks. – Carl Witthoftatc
level? Could you modelatc
as a random effect? Or possibly agglomerateatc
levels? Also, have you checked that your numeric variables arenumeric
in R? Anyway, parallelization will not help with the memory problem. In contrast, since each CPU will need RAM for its task, parallelization will exacerbate the problem. – Roland