0
votes

Using mice package I imputed a dataset like:

imp <- mice(nhanes)

It generates 5 imputed datasets for each variables:

imp$imp$bmi
#      1    2    3    4    5
#1  35.3 30.1 26.3 28.7 27.2
#3  30.1 22.0 30.1 28.7 22.0
#4  21.7 27.2 25.5 24.9 21.7
#6  24.9 25.5 24.9 27.5 22.5
#10 20.4 33.2 26.3 27.2 27.4
#11 22.0 27.2 27.2 30.1 22.0
#12 27.4 20.4 27.2 27.2 20.4
#16 30.1 30.1 27.2 22.5 29.6
#21 27.4 27.2 26.3 22.0 30.1

So I do not understand how to choose the best imputed data.

For example for bmi (above) what of 5 columns will be the best choice ?

Thank you

2

2 Answers

1
votes

There isn't a best dataset. Selecting a single dataset would only account for within dataset variation/error but not the between-imputed-datasets variation.

This is why analysis such as regression should utilise the with and pool commands when working with imputed data.

1
votes

The whole concept of mice is that you have multiple imputed datasets.

If you only want 1 imputed dataset you can use Single Imputation packages like missForest, imputeR, VIM which are sometimes a little bit easier to use / understand syntax wise.

The advantage of a Multiple Imputation package like mice is, that you have multiple imputed datasets, which can help account for uncertainties that occur by performing the imputation.

You would not use one of the imputed datasets, instead you would perform your analysis on all 5 (or more) of these datasets.

By doing this, you know how much the results of your analysis can vary. Afterwards you can pool your results. mice helps you along this process.

A typical mice workflow would look like this:

# 1. Perform imputations
imp <- mice(nhanes, maxit = 2, m = 2)

# 2. Create model for all imputed datasets / in this case m = 2
fit <- with(data = imp, exp = lm(bmi ~ hyp + chl))

# 3. Pool the results
pool <- pool(fit)

# Print results
summary(pool)