5
votes

Background: Multi-model inference with glmulti

glmulti is a R function/package for automated model selection for general linear models that constructs all possible general linear models given a dependent variable and a set of predictors, fits them via the classic glm function and allows then for multi-model inference (e.g., using model weights derived from AICc, BIC). glmulti works in theory also with any other function that returns coefficients, the log-likelihood of the model and the number of free parameters (and maybe other information?) in the same format that glm does.

My goal: Multi-model inference with robust errors

I would like to use glmulti with robust modeling of the errors of a quantitative dependent variable to guard against the effect out outliers.

For example, I could assume that the errors in the linear model are distributed as a t distribution instead of as a normal distribution. With its kurtosis parameter the t distribution can have heavy tails and is thus more robust to outliers (as compared to the normal distribution).

However, I'm not committed to using the t distribution approach. I'm happy with any approach that gives back a log-likelihood and thus works with the multimodel approach in glmulti. But that means, that unfortunately I cannot use the well-known robust linear models in R (e.g., lmRob from robust or lmrob from robustbase) because they do not operate under the log-likelihood framework and thus cannot work with glmulti.

The problem: I can't find a robust regression function that works with glmulti

The only robust linear regression function for R I found that operates under the log-likelihood framework is heavyLm (from the heavy package); it models the errors with a t distribution. Unfortunately, heavyLm does not work with glmulti (at least not out of the box) because it has no S3 method for loglik (and possibly other things).

To illustrate:

library(glmulti)
library(heavy)

Using the dataset stackloss

head(stackloss)

Regular Gaussian linear model:

summary(glm(stack.loss ~ ., data = stackloss))

Multi-model inference with glmulti using glm's default Gaussian link function

stackloss.glmulti <- glmulti(stack.loss ~ ., data = stackloss, level=1, crit=bic)
print(stackloss.glmulti)
plot(stackloss.glmulti)

Linear model with t distributed error (default is df=4)

summary(heavyLm(stack.loss ~ ., data = stackloss))

Multi-model inference with glmulti calling heavyLm as the fitting function

stackloss.heavyLm.glmulti <- glmulti(stack.loss ~ .,
 data = stackloss, level=1, crit=bic, fitfunction=heavyLm)

gives the following error:

    Initialization...
    Error in UseMethod("logLik") :
    no applicable method for 'logLik' applied to an object of class "heavyLm".

If I define the following function,

logLik.heavyLm <- function(x){x$logLik}

glmulti can get the log-likelihood, but then the next error occurs:

    Initialization...
    Error in .jcall(molly, "V", "supplyErrorDF", 
    as.integer(attr(logLik(fitfunc(as.formula(paste(y,  : 
    method supplyErrorDF with signature ([I)V not found

The question: Which function/package for robust linear regression works with glmulti (i.e., behaves like glm)?

There is probably a way to define further functions to get heavyLm working with glmulti, but before embarking on this journey I wanted to ask whether anybody

  • knows of a robust linear regression function that (a) operates under the log-likelihood framework and (b) behaves like glm (and will thus work with glmulti out-of-the-box).
  • got heavyLm already working with glmulti.

Any help is very much appreciated!

1

1 Answers

1
votes

Here is an answer using heavyLm. Even though this is a relatively old question, the same problem that you mentioned still occurs when using heavyLm (i.e., the error message Error in .jcall(molly, "V", "supplyErrorDF"…).

The problem is that glmulti requires the degrees of freedom of the model, to be passed as an attribute of you need to provide as an attribute of the value returned by function logLik.heavyLm; see the documentation for the function logLik for details. Moreover, it turns out that you also need to provide a function to return the number of data points that were used for fitting the model, since the information criteria (AIC, BIC, …) depend on this value too. This is done by function nobs.heavyLm in the code below.

Here is the code:

nobs.heavyLm <- function(mdl) mdl$dims[1] # the sample size (number of data points)

logLik.heavyLm <- function(mdl) {
    res <- mdl$logLik
    attr(res, "nobs") <- nobs.heavyLm(mdl) # this is not really needed for 'glmulti', but is included to adhere to the format of 'logLik'
    attr(res, "df") <- length(mdl$coefficients) + 1 + 1 # I am also considering the scale parameter that is estimated; see mdl$family
    class(res) <- "logLik"
    res
}

which, when put together with the code that you provided, produces the following result:

Initialization...
TASK: Exhaustive screening of candidate set.
Fitting...
Completed.

> print(stackloss.glmulti)
glmulti.analysis
Method: h / Fitting: glm / IC used: bic
Level: 1 / Marginality: FALSE
From 8 models:
Best IC: 117.892471265874
Best model:
[1] "stack.loss ~ 1 + Air.Flow + Water.Temp"
Evidence weight: 0.709174196998897
Worst IC: 162.083142797858
2 models within 2 IC units.
1 models to reach 95% of evidence weight.

producing therefore 2 models within the 2 BIC units threshold.

An important remark though: I am not sure that the expression above for the degrees of freedom is strictly correct. For a standard linear model, the degrees of freedom would be equal to p + 1, where p is the number of parameters in the model, and the extra parameter (the + 1) is the "error" variance (which is used to calculate the likelihood). In function logLik.heavyLm above, it is not clear to me whether one should also count the "scale parameter" that is estimated by heavyLm as an extra degree of freedom, and hence the p + 1 + 1, which would be the case if the likelihood is also a function of this parameter. Unfortunately, I cannot confirm this, since I don’t have access to the reference that heavyLm cites (the paper by Dempster et al., 1980). Because of this, I am counting the scale parameter, thereby providing a (slightly more) conservative estimate of model complexity, penalizing "complex" models. This difference should be negligible, except in the small sample case.