Imputation using mice with clustered data

Question

So I am using the mice package to impute missing data. I'm new to imputation so I've got to a point but have run into a steep learning curve. To give a toy example:

library(mice)
# Using nhanes dataset as example
df1 <- mice(nhanes, m=10)

So as you can see I imputed df1 10 times using mostly default settings - and I am comfortable using this result in regression models, pooling results etc. However in my real life data, I have survey data from different countries. And so levels of missings differ by country, as do the values of specific variables - i.e. age, education level etc. Therefore I would like to impute the misssings, allowing for clustering by the country. So I will create a grouping variable which has no missings (of course in this toy example the correlations with other variables are missing, but in my real data they exist)

# Create a grouping variable
nhanes$country <- sample(c("A", "B"), size=nrow(nhanes), replace=TRUE)

So how to I tell mice() that this variable is different from the others - i.e. it is a level in a multi-level dataset?

Would running mice on each factor level be a good workaround? For example, mice(nhanes[which(nhanes$country == 'A'),], m=10) and then loop over the factors or use your favorite R's groupby operation? This of course assumes that to impute data for country A, one doesn't need other countries, ie they're independent. — Gene Burinsky
Well yes I did try this - and there is a function to combine the datasets 'rbind.mids(' - but I've found this functino gives me lots of warnings and errors that I could not figure out. Ultimately I thought imputing with recognition of the data structure would be better. Thanks for the suggestion — user2498193

SimonG SimonG · Accepted Answer · 2016-06-29T21:18:12

If you're thinking clusters as in "mixed-effects" models, then you should use the methods provided by mice intended for clustered data. These methods can be found in the manual and are usually prefixed like 2l.something.

The variety of methods for clustered data is somewhat limited in mice, but I can recommend using 2l.pan for missing data in lower-level units and 2l.only.norm at the cluster level.

As an alternative to mixed-effects models, you may consider using dummy indicators to represent the cluster structure (i.e., one dummy variable for each cluster). This method is not ideal when you think of the clusters from the perspective of mixed-effects models. So if you want to do mixed-effects analyses, then stick to mixed-effects models when you can.

Below, I show an example for both strategies.

Preparation:

library(mice)
data(nhanes)

set.seed(123)
nhanes <- within(nhanes,{
  country <- factor(sample(LETTERS[1:10], size=nrow(nhanes), replace=TRUE))
  countryID <- as.numeric(country)
})

Case 1: Imputation using mixed-effects models

This section uses 2l.pan to impute the three variables with missing data. Note that I use clusterID as the cluster variable by specifying a -2 in the predictor matrix. To all other variables, I assign fixed effects only (1).

# "empty" imputation as a template
imp0 <- mice(nhanes, maxit=0)
pred1 <- imp0$predictorMatrix
meth1 <- imp0$method

# set imputation procedures
meth1[c("bmi","hyp","chl")] <- "2l.pan"

# set predictor Matrix (mixed-effects models with random intercept
# for countryID and fixed effects otherwise)
pred1[,"country"] <- 0     # don't use country factor
pred1[,"countryID"] <- -2  # use countryID as cluster variable
pred1["bmi", c("age","hyp","chl")] <- c(1,1,1)  # fixed effects (bmi)
pred1["hyp", c("age","bmi","chl")] <- c(1,1,1)  # fixed effects (hyp)
pred1["chl", c("age","bmi","hyp")] <- c(1,1,1)  # fixed effects (chl)

# impute
imp1 <- mice(nhanes, maxit=20, m=10, predictorMatrix=pred1, method=meth1)

Case 2: Imputation using dummy indicators (DIs) for clusters

This section uses pmm for imputation, and the clustered structure is represented in an "ad hoc" fashion. That is, the clustered aren't represented by random effects but by fixed effects instead. This may exaggerate the cluster-level variability of the variables with missing data, so be sure you know what you do when you use it.

# create dummy indicator variables
DIs <- with(nhanes, contrasts(country)[country,])
colnames(DIs) <- paste0("country",colnames(DIs))
nhanes <- cbind(nhanes,DIs)


# "empty" imputation as a template
imp0 <- mice(nhanes, maxit=0)
pred2 <- imp0$predictorMatrix
meth2 <- imp0$method

# set imputation procedures
meth2[c("bmi","hyp","chl")] <- "pmm"

# for countryID and fixed effects otherwise)
pred2[,"country"] <- 0     # don't use country factor
pred2[,"countryID"] <- 0   # don't use countryID
pred2[,colnames(DIs)] <- 1 # use dummy indicators
pred2["bmi", c("age","hyp","chl")] <- c(1,1,1)  # fixed effects (bmi)
pred2["hyp", c("age","bmi","chl")] <- c(1,1,1)  # fixed effects (hyp)
pred2["chl", c("age","bmi","hyp")] <- c(1,1,1)  # fixed effects (chl)

# impute
imp2 <- mice(nhanes, maxit=20, m=10, predictorMatrix=pred2, method=meth2)

If you want to read up on what to think of these methods, have a look at one or two of these papers.

Imputation using mice with clustered data

2 Answers