2
votes

I am looking for feedback to determine how to correctly specify random effects to account for correlation in a repeated measures design, but with multiple levels of correlation (including the data being longitudinal for each combination of predictors). The outcome is binary, so I will be fitting a logistic mixed model. I was planning to use the glmer() function from the lme4 package. If you're wondering how these data arise, one example is from an eye tracker: people's eyes are "tracked" for 30 seconds, e.g., under different levels of the predictors, determining if they looked at a certain object on the screen or not (hence the binary outcome).

Study design (which can be seen by processing the code under "Dummy dataset" below in R):

  • The outcome (Binary_outcome) is binary.
    • There are repeated measures: each subject's binary response is recorded multiple times within each combination of predictors (see "Dummy dataset" below for structure).
  • There are two predictors of interest (both binary, categorical):
    • One between-subjects factor, Sex (male/female).
    • One within-subjects factor, Intervention (pre/post).
  • Each subject is measured over six trials (under which there are repeated measures), Trial.
    • Note there are 12 possible trials a person could be assigned. Thus, not every subject is in all 12 trials, but rather a random set of 6 trials.
    • Trial is not a variable of interest. It is merely thought that observations within an individual, within a trial could be more alike, and thus Trial should also be accounted for as a form of cluster correlation.

Dummy dataset: Shows the general structure of my data (although this is not the actual dataset):

structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), Trial = c("A", "A", 
"A", "B", "B", "B", "C", "C", "C", "D", "D", "D", "E", "E", "E", 
"F", "F", "F", "G", "G", "G", "E", "E", "E", "D", "D", "D", "A", 
"A", "A", "J", "J", "J", "L", "L", "L"), Intervention = c("Pre", "Pre", "Pre", "Pre", 
"Pre", "Pre", "Pre", "Pre", "Pre", "Post", "Post", "Post", "Post", 
"Post", "Post", "Post", "Post", "Post", "Pre", "Pre", "Pre", 
"Pre", "Pre", "Pre", "Pre", "Pre", "Pre", "Post", "Post", "Post", 
"Post", "Post", "Post", "Post", "Post", "Post"), Sex = c("Female", 
"Female", "Female", "Female", "Female", "Female", "Female", "Female", 
"Female", "Female", "Female", "Female", "Female", "Female", "Female", 
"Female", "Female", "Female", "Male", "Male", "Male", "Male", 
"Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", 
"Male", "Male", "Male", "Male", "Male", "Male"), Binary_outcome = c(1L, 
1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 
1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 
1L, 1L, 1L)), class = "data.frame", row.names = c(NA, -36L))

Current code being used: This is what I'm using currently, but I do not know if I should be specifying the random effects differently based on the structure of the data (outlined below under "Accounting correctly for correlation").

install.packages("lme4")
library(lme4)

logit_model <- glmer(Binary_outcome ~ factor(Sex)*factor(Intervention) + 
                                (1 | Trial) + 
                                (1 | Subject), 
                     data = data01, 
                     family="binomial")

Accounting correctly for correlation: This is where my question lies. Comments/Questions:

  • I believe both the Subject and Trial random effects are crossed (not nested), because Subject 1 is always Subject 1, and Trial A is always Trial A. There is no way to re-number/re-letter these as you could if the design were nested (see, e.g.: https://stats.stackexchange.com/questions/228800/crossed-vs-nested-random-effects-how-do-they-differ-and-how-are-they-specified).
  • As can be seen above under "Current code being used," I have included the fixed effects of interest (Sex, Intervention, and Sex**Intervention*), and random intercepts for Trial and Subject using + (1 | Trial) + (1 | Subject).
    • Does + (1 | Trial) + (1 | Subject) correctly "tell" the model to account for the correlation within a person, within a trial, or does this need to be specified in another way? Even though I don't think the random effects are nested, it still feels like there's a "hierarchy," but maybe this is already accounted for by + (1 | Trial) + (1 | Subject).
    • These data seem unique in that, even within a trial, there are multiple measurements (0s/1s) for each subject. I am unsure of the implications of this with regard to the model fitting.
    • Do I need to further tell the model to differentiate the within- and between-subjects fixed effects? Or does the code "pick-up" on this "automatically" with + (1 | Trial) + (1 | Subject)? It correctly does this when you simply specify a random intercept for subject in lme() with + (1 | Subject), or aov() with + Error(Subject), for example. This is why I simply used + (1 | Trial) + (1 | Subject) here.
  • Lastly, I don't know if it matters that not every subject gets every trial (but rather 6 out of 12 possible trials), and if this affects some aspect of the code.

I am looking for your feedback, and preferably also the reference(s) (texts, peer-reviewed papers) used to determine your feedback. I have multiple texts on logistic regression, broader categorical data analysis, and mixed models, but - as far as I can tell - none of them bring together the ideas I have posed here. Thus, knowing if a resource that is particularly useful to this situation would also be helpful.

1
1/2 A paper that may help (although it’s a linear, not logistic, mixed model) is Baayen et al. (2008): Mixed-effects modeling with crossed random effect for subjects and items. They at least have items (trials) that are repeated, as well as a within-subjects variable (called SOA in the paper). There are a few differences though: no between-subjects variable, use of the same items/trials for all subjects (and across the two levels of SOA), and not having repeated measures within an item/trial (which is inherent in my binary data, which, again, is over some span of time for each individual).Meg
2/2 However, this paper is showing me that the random effects specification may be at least somewhat related to how much your model “needs” in order to be well estimated/parameterized. They start with more elaborate random effects, but ultimately end up with + (1 | Item) + (1 | Subj), which is equivalent to my + (1 | Trial) + (1 | Subject) in my post.Meg
Another insight from Baayen et al. (2008): Mixed-effects modeling with crossed random effect for subjects and items: In their second example, not every subject gets the same words, but the random effects are again specified as + (1 | Word) + (1 | Subject), so this may indicate the answer to my question about whether or not it matters that not all subjects have the same trials (in my case, images) is no.Meg
this belongs on CrossValidated. I've voted to close/migrate it there.Ben Bolker
Thank you, @Ben Bolker. I wasn't sure where best to post. I support the move to CrossValidated.Meg

1 Answers

3
votes

(1|Trial) + (1|Subject) is reasonable: it specifies variation among trials, and among subjects. The effects are indeed crossed: if you only wanted to allow variation among trials within subjects you'd use (1|Subject/Trial); for variation among subjects within trials you'd use (1|Trial/Subject). Since you have multiple observations per trial:subject combination you could use (1|Trial) + (1|Subject) + (1|Subject:Trial) to allow for another level of variation, but I have an alternative suggestion (see below).

I believe the maximal model corresponding to this design is

Binary_outcome ~ Sex*Intervention + cor(Trial | Subject) + (1|Trial)

Where cor() expresses a correlation matrix, i.e. we are not trying to estimate the variation across repeated measures within the same trial for each subject — because we don't have that information. Here (1|Trial) expresses the variation across trials that is common to all subjects, while cor(Trial|Subject) expresses the correlation across trials within subjects. However, while it's a useful exercise to try to identify what the maximal would be, it's not practical here for two reasons: (1) estimating a full correlation matrix across trials would require (n*(n-1)/2 = 12*11/2 =) 66 parameters, which won't be possible without a giant data set and a giant computer; (2) few of the available mixed-model tools in R provide the flexibility to constrain a random effect to a correlation matrix (MCMCglmm does, and some of the other Bayesian tools such as brms might; glmmTMB could be extended fairly easily, and lme4 could be hacked ...)

  • There is no need to code the "level" of fixed effects (within- vs between-) explicitly
  • lack of balance and/or lack of complete crossing will reduce your power for a given sample size, but is not otherwise a problem (this is one of the big advantages of mixed model approaches)
  • It sounds like the multiple observations per subject:trial combination are exchangeable (i.e. you can treat them all as samples from the same distribution, with the same expected value etc.: an exception to this would be if you wanted to take account of order of observations within subject:trial, e.g. a trend in accuracy over time). In this case, you're better off aggregating and doing a binomial regression — treating a subject as "m successes out of N trials" rather than "{1,0,1,1,1,0,0,1}".
  • For small effective sample sizes per cluster (i.e. if there are a fairly small number of total binary observations per subject), you need to be careful about some of the technical details: the accuracy of the widely used Laplace approximation (used by lme4, glmmTMB, INLA, ...) may be poor. Unfortunately, other than going Bayesian, you don't have a lot of options here - adaptive Gauss-Hermite quadrature (lme4, GLMMadaptive) is rarely implemented/available for problems with multiple random effects.