5
votes

I am really baffled about why my imputation is failing in R's mice package. I am attempting a very simple operation with the following data frame:

dfn <- read.table(text =
"a b c  d
 0 1 0  1
 1 0 0  0
 0 0 0  0
NA 0 0  0
 0 0 0 NA", header = TRUE)

I then use mice in the following way to perform a simple mean imputation:

imp <- mice(dfn, method = "mean", m = 1, maxit =1)
filled <- complete(imp)

However, my completed data looks like this:

filled
#     a b c  d
#1 0.00 1 0  1
#2 1.00 0 0  0
#3 0.00 0 0  0
#4 0.25 0 0  0
#5 0.00 0 0 NA

Why am I still getting this trailing NA? This is the simplest failing example I could construct, but my real data set is much larger and I am just trying to get a sense of where things are going wrong. Any help would be greatly appreciated!

1
Okay, so it seems that the issue is being caused by one column being a perfect linear combination of some of the others. Any idea about how to handle this in real data?mjnichol
This question appears to be off-topic because it has been cross-posted on stats.stackexchange.com: stats.stackexchange.com/q/127104/11849Roland
@Roland Yes, I posted it there as well and a user gave the reason for the issue in the comments.mjnichol

1 Answers

1
votes

I'm not really sure how accurate this is, but here is an attempt. Even though method="mean" is supposed to impute the unconditional mean, it appears from the documentation that the prdictorMatrix is not being changed accordingly.

Normally, leftover NA occur because the predictors suffer from multicollinearity or because there are too few cases per variable (such that the imputation model cannot be estimated). However, method="mean" shouldn't behave that way.

Here is what I did:

dfn <- read.table(text="a b c  d
 0 1 0  1
 1 0 0  0
 0 0 0  0
NA 0 0  0
 0 0 0 NA", header=TRUE)

imp <- mice( dfn, method="mean", predictorMatrix=diag(ncol(dfn)) )
complete(imp)

# 1 0.00 1 0 1.00
# 2 1.00 0 0 0.00
# 3 0.00 0 0 0.00
# 4 0.25 0 0 0.00
# 5 0.00 0 0 0.25

You can try this using your actual data set, but you should check the results carefully. For example, do:

sapply(dfn, function(x) mean(x,na.rm=TRUE))

The means for each variable should be identical to those that have been imputed. Please let me know if this solves your problem.