2
votes

I am looking for feedback to determine how to correctly specify random effects to account for correlation in a repeated measures design, but with multiple levels of correlation (including the data being longitudinal for each combination of predictors). The outcome is binary, so I will be fitting a logistic mixed model. I was planning to use the glmer() function from the lme4 package. If you're wondering how these data arise, one example is from an eye tracker: people's eyes are "tracked" for 30 seconds, e.g., under different levels of the predictors, determining if they looked at a certain object on the screen or not (hence the binary outcome).

Study design (which can be seen by processing the code under "Dummy dataset" below in R):

  • The outcome (Binary_outcome) is binary.
    • There are repeated measures: each subject's binary response is recorded multiple times within each combination of predictors (see "Dummy dataset" below for structure).
  • There are two predictors of interest (both binary, categorical):
    • One between-subjects factor, Sex (male/female).
    • One within-subjects factor, Intervention (pre/post).
  • Each subject is measured over six trials (under which there are repeated measures), Trial.
    • Note there are 12 possible trials a person could be assigned. Thus, not every subject is in all 12 trials, but rather a random set of 6 trials.
    • Trial is not a variable of interest. It is merely thought that observations within an individual, within a trial could be more alike, and thus Trial should also be accounted for as a form of cluster correlation.

Dummy dataset: Shows the general structure of my data (although this is not the actual dataset):

structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), Trial = c("A", "A", 
"A", "B", "B", "B", "C", "C", "C", "D", "D", "D", "E", "E", "E", 
"F", "F", "F", "G", "G", "G", "E", "E", "E", "D", "D", "D", "A", 
"A", "A", "J", "J", "J", "L", "L", "L"), Intervention = c("Pre", "Pre", "Pre", "Pre", 
"Pre", "Pre", "Pre", "Pre", "Pre", "Post", "Post", "Post", "Post", 
"Post", "Post", "Post", "Post", "Post", "Pre", "Pre", "Pre", 
"Pre", "Pre", "Pre", "Pre", "Pre", "Pre", "Post", "Post", "Post", 
"Post", "Post", "Post", "Post", "Post", "Post"), Sex = c("Female", 
"Female", "Female", "Female", "Female", "Female", "Female", "Female", 
"Female", "Female", "Female", "Female", "Female", "Female", "Female", 
"Female", "Female", "Female", "Male", "Male", "Male", "Male", 
"Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", 
"Male", "Male", "Male", "Male", "Male", "Male"), Binary_outcome = c(1L, 
1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 
1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 
1L, 1L, 1L)), class = "data.frame", row.names = c(NA, -36L))

Current code being used: This is what I'm using currently, but I do not know if I should be specifying the random effects differently based on the structure of the data (outlined below under "Accounting correctly for correlation").

install.packages("lme4")
library(lme4)

logit_model <- glmer(Binary_outcome ~ factor(Sex)*factor(Intervention) + 
                                (1 | Trial) + 
                                (1 | Subject), 
                     data = data01, 
                     family="binomial")

Accounting correctly for correlation: This is where my question lies. Comments/Questions:

  • I believe both the Subject and Trial random effects are crossed (not nested), because Subject 1 is always Subject 1, and Trial A is always Trial A. There is no way to re-number/re-letter these as you could if the design were nested (see, e.g.: https://stats.stackexchange.com/questions/228800/crossed-vs-nested-random-effects-how-do-they-differ-and-how-are-they-specified).
  • As can be seen above under "Current code being used," I have included the fixed effects of interest (Sex, Intervention, and Sex**Intervention*), and random intercepts for Trial and Subject using + (1 | Trial) + (1 | Subject).
    • Does + (1 | Trial) + (1 | Subject) correctly "tell" the model to account for the correlation within a person, within a trial, or does this need to be specified in another way? Even though I don't think the random effects are nested, it still feels like there's a "hierarchy," but maybe this is already accounted for by + (1 | Trial) + (1 | Subject).
    • These data seem unique in that, even within a trial, there are multiple measurements (0s/1s) for each subject. I am unsure of the implications of this with regard to the model fitting.
    • Do I need to further tell the model to differentiate the within- and between-subjects fixed effects? Or does the code "pick-up" on this "automatically" with + (1 | Trial) + (1 | Subject)? It correctly does this when you simply specify a random intercept for subject in lme() with + (1 | Subject), or aov() with + Error(Subject), for example. This is why I simply used + (1 | Trial) + (1 | Subject) here.
  • Lastly, I don't know if it matters that not every subject gets every trial (but rather 6 out of 12 possible trials), and if this affects some aspect of the code.

I am looking for your feedback, and preferably also the reference(s) (texts, peer-reviewed papers) used to determine your feedback. I have multiple texts on logistic regression, broader categorical data analysis, and mixed models, but - as far as I can tell - none of them bring together the ideas I have posed here. Thus, knowing if a resource that is particularly useful to this situation would also be helpful.

1
1/2 A paper that may help (although it’s a linear, not logistic, mixed model) is Baayen et al. (2008): Mixed-effects modeling with crossed random effect for subjects and items. They at least have items (trials) that are repeated, as well as a within-subjects variable (called SOA in the paper). There are a few differences though: no between-subjects variable, use of the same items/trials for all subjects (and across the two levels of SOA), and not having repeated measures within an item/trial (which is inherent in my binary data, which, again, is over some span of time for each individual). - Meg
2/2 However, this paper is showing me that the random effects specification may be at least somewhat related to how much your model “needs” in order to be well estimated/parameterized. They start with more elaborate random effects, but ultimately end up with + (1 | Item) + (1 | Subj), which is equivalent to my + (1 | Trial) + (1 | Subject) in my post. - Meg
Another insight from Baayen et al. (2008): Mixed-effects modeling with crossed random effect for subjects and items: In their second example, not every subject gets the same words, but the random effects are again specified as + (1 | Word) + (1 | Subject), so this may indicate the answer to my question about whether or not it matters that not all subjects have the same trials (in my case, images) is no. - Meg
this belongs on CrossValidated. I've voted to close/migrate it there. - Ben Bolker
Thank you, @Ben Bolker. I wasn't sure where best to post. I support the move to CrossValidated. - Meg

1 Answers

3
votes

(1|Trial) + (1|Subject) is reasonable: it specifies variation among trials, and among subjects. The effects are indeed crossed: if you only wanted to allow variation among trials within subjects you'd use (1|Subject/Trial); for variation among subjects within trials you'd use (1|Trial/Subject). Since you have multiple observations per trial:subject combination you could use (1|Trial) + (1|Subject) + (1|Subject:Trial) to allow for another level of variation, but I have an alternative suggestion (see below).

I believe the maximal model corresponding to this design is

Binary_outcome ~ Sex*Intervention + cor(Trial | Subject) + (1|Trial)

Where cor() expresses a correlation matrix, i.e. we are not trying to estimate the variation across repeated measures within the same trial for each subject — because we don't have that information. Here (1|Trial) expresses the variation across trials that is common to all subjects, while cor(Trial|Subject) expresses the correlation across trials within subjects. However, while it's a useful exercise to try to identify what the maximal would be, it's not practical here for two reasons: (1) estimating a full correlation matrix across trials would require (n*(n-1)/2 = 12*11/2 =) 66 parameters, which won't be possible without a giant data set and a giant computer; (2) few of the available mixed-model tools in R provide the flexibility to constrain a random effect to a correlation matrix (MCMCglmm does, and some of the other Bayesian tools such as brms might; glmmTMB could be extended fairly easily, and lme4 could be hacked ...)

  • There is no need to code the "level" of fixed effects (within- vs between-) explicitly
  • lack of balance and/or lack of complete crossing will reduce your power for a given sample size, but is not otherwise a problem (this is one of the big advantages of mixed model approaches)
  • It sounds like the multiple observations per subject:trial combination are exchangeable (i.e. you can treat them all as samples from the same distribution, with the same expected value etc.: an exception to this would be if you wanted to take account of order of observations within subject:trial, e.g. a trend in accuracy over time). In this case, you're better off aggregating and doing a binomial regression — treating a subject as "m successes out of N trials" rather than "{1,0,1,1,1,0,0,1}".
  • For small effective sample sizes per cluster (i.e. if there are a fairly small number of total binary observations per subject), you need to be careful about some of the technical details: the accuracy of the widely used Laplace approximation (used by lme4, glmmTMB, INLA, ...) may be poor. Unfortunately, other than going Bayesian, you don't have a lot of options here - adaptive Gauss-Hermite quadrature (lme4, GLMMadaptive) is rarely implemented/available for problems with multiple random effects.