2
votes

I encounter a problem with the use of the mice function to do multiple imputation. I want to do imputation only on part of the missing data, what looking at the help seems possible and straightworward. But i can't get it to work. here is the example:

I have some missing data on x and y:

library(mice)
plouf <- data.frame(ID = rep(LETTERS[1:10],each = 10), x = sample(10,100,replace = T), y = sample(10,100,replace = T))
plouf[sample(100,10),c("x","y")] <- NA

I want only to impute missing data on y:

where <- data.frame(ID = rep(FALSE,100),x = rep(FALSE,100),y = is.na(plouf$y))

I do the imputation

plouf.imp <- mice(plouf, m = 1,method="pmm",maxit=5,where = where)

I look at the imputed values:

test <- complete(plouf.imp)

Here i still have NAs on y:

> sum(is.na(test$y))
[1] 10

if I use where to say to impute on all values, it works:

where <- data.frame(ID = rep(FALSE,100),x = is.na(plouf$x),y = is.na(plouf$y))
plouf.imp <- mice(plouf, m = 1,method="pmm",maxit=5,where = where)
test <- complete(plouf.imp)

> sum(is.na(test$y))
[1] 0

but it does the imputation on x too, that I don't want in this specific case (speed reason in a statistial simulation study)

Has anyone any idea ?

1

1 Answers

2
votes

This is happening because of below code -

plouf[sample(100,10),c("x","y")] <- NA

Let's consider your 1st case wherein you want to impute y only. Check it's PredictorMatrix

plouf.imp <- mice(plouf, m = 1, method="pmm", maxit=5, where = whr)
plouf.imp
#PredictorMatrix:
#   ID x y
#ID  0 0 0
#x   0 0 0
#y   1 1 0

It says that y's missing value will be predicted based on ID & x since it's value is 1 in row y.

Now check your sample data where you are populating NA in x & y column. You can notice that wherever y is NA x is also having the same NA value.

So what happens is that when mice refers PredictorMatrix for imputation in y column it encounters NA in x and ignore those rows as all independent variables (i.e. ID & x) are expected to be non-missing in order to predict the outcome i.e. missing values in y.

Try this -

library(mice)

#sample data
set.seed(123)
plouf <- data.frame(ID = rep(LETTERS[1:10],each = 10), x = sample(10,100,replace = T), y = sample(10,100,replace = T))
plouf[sample(100,10), "x"] <- NA
set.seed(999)
plouf[sample(100,10), "y"] <- NA

#missing value imputation
whr <- data.frame(ID = rep(FALSE,100), x = rep(FALSE,100), y = is.na(plouf$y))
plouf.imp <- mice(plouf, m = 1, method="pmm", maxit=5, where = whr)
test <- complete(plouf.imp)
sum(is.na(test$y))
#[1] 1

Here only one value of y is left to be imputed and in this case both x & y are having NA value i.e. row number 39 (similar to your 1st case).