0
votes

I have been performing multiple imputation using the mice-package (van Buuren) in R, with m = 50 (50 imputation datasets) and 20 iterations for roughly 9 variables with missing data (MAR = missing at random) ranging from 5-13 %. After this, I want to proceed with estimating descriptive statistics for my dataset (i.e. not use complete case analysis only for descriptive statistics, but also compare the results with the descriptive statistics from my imputation). So my question is now, how to proceed.

I know that the correct procedure for dealing with MICE-data is:

  1. Impute the missing data by the mice function, resulting in a multiple imputed data set (class mids);
  2. Fit the model of interest (scientific model) on each imputed data set by the with() function, resulting an object of class mira;
  3. Pool the estimates from each model into a single set of estimates and standard errors, resulting is an object of class mipo; Optionally, compare pooled estimates from different scientific models by the D1() or D3() functions.

My problem is that I do not understand how to apply this theory on my data. I have done:

#Load package:
library(mice)
library(dplyr)

#Perform imputation:
Imp_Data <- mice(MY_DATA, m=50, method = "pmm", maxit = 20, seed = 123)

#Make the imputed data in long format:
Imp_Data_Long <- complete(Imp_Data, action = "long", include = FALSE)

I then assumed that this procedure is correct for getting the median of the BMI variable, where the .imp variable is the number of the imputation dataset (i.e. from 1-50):

BMI_Medians_50 <- Imp_Data_Long %>% group_by(.imp, Smoker) %>% summarise(Med_BMI = median(BMI))

BMI_Median_Pooled <- mean(BMI_Medians_50$Med_BMI)

I might have understood things completely wrong, but I have very much tried to understand the right procedure and found very different procedures here on StackOverflow and StatQuest.

1

1 Answers

0
votes

Well, in general you perform the multiple imputation (in comparison to single imputation), because you want to account for the uncertainty that comes with performing imputation.

The missing data is ultimately lost - we can only estimate, what the real data might could look like. With multiple imputation we generate multiple estimates, what the real data could look like. Our goal is to have something like a probability distribution for the value of the imputed values.

For your descriptive statistics you do not need a pooling with rubins rules (these are important for standard errors and other metrics for linear models). You would calculate your statistics on each of your m = 50 imputed datasets separately and pool / sum them up with your desired metrics.

What you want to archive is to provide your reader with an information about the uncertainty that comes with the imputation. (and an estimate in which bounds the imputed values most likely are)

Looking for example at the mean as a descriptive statistic. Here you could e.g. provide the lowest mean and the highest mean over these imputed datasets. You can provide the mean of these means and the standard deviation for the mean over the imputed datasets.

Your complete case analysis would just provide e.g. 3.3 as a mean for a variable. But, might be the same mean varies quite a lot in your m=50 multiple imputed datasets e.g. from 1.1 up to 50.3. This can give you the valueable information, that you should take the 3.3 from your complete case analysis with a lot of care and that there is a lot of uncertainty in general with this kind of statistic for this dataset.