3
votes

I have a factor with missing values. I know that this factor value depends on the combination of a few dates.

I'm having some trouble getting this to work though. Seems both classes are tricky, especially Date.

For a simple example lets have 1 Date and 1 factor:

require(VIM)
toimpute          <- data.frame(mydates = seq(as.Date("1990-01-01"),as.Date("2000-01-01"),50),
                        imputeme = c(NA,NA,rep(c("a","b","c"),24)))
toimpute$imputeme <- as.factor(toimpute$imputeme)

It seems kNN won't go for it:

imputed <- kNN(toimpute,variable =  "imputeme")

Error in [.data.frame(data.x, , i) : undefined columns selected

mice also doesn't like it. I thought mice was at least supposed to work with factors, though this message says it must be numeric (perhaps it allows factor dependent variables but only numeric for independent variables?):

imputed <- mice(toimpute)
 iter imp variable
  1   1  imputeme
Error in FUN(newX[, i], ...) : 'x' must be numeric
In addition: Warning messages:
1: In var(data[, j], na.rm = TRUE) :
  Calling var(x) on a factor x is deprecated and will become an error.
  Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
2: In FUN(newX[, i], ...) : NAs introduced by coercion

I guess if nothing else I can do a random forest model to predict the class of the observations with missing data, but if there's a way to do it with one of the more common missing value functions I'd like to know.

1
I think aregImpute works on factor variables. Check this linkJoseph Wood
@JosephWood That seems to work, add it as an answer if you likeHack-R
I'm not sure about dates. From the documentation I'm guessing they are automatically converted to factors. Also, you can look into transcan from Hmisc.Joseph Wood

1 Answers

2
votes

To handle imputation for factor variables, you can use aregImpute or transcan from the Hmisc package.

toimpute          <- data.frame(mydates = seq(as.Date("1990-01-01"),as.Date("2000-01-01"),50),
                                imputeme = c(NA,NA,rep(c("a","b","c"),24)))
toimpute$imputeme <- as.factor(toimpute$imputeme)
require(Hmisc)
imputed <- aregImpute(data=toimpute,mydates~imputeme)
table(is.na(imputed))

FALSE 
   19 

From the documentation under Arguments (for aregImpute), it reads:

formula
an S model formula. You can specify restrictions for transformations of variables. The function automatically determines which variables are categorical (i.e., factor, category, or character vectors). Binary variables are automatically restricted to be linear. Force linear transformations of continuous variables by enclosing variables by the identify function (I()). It is recommended that factor() or as.factor() do not appear in the formula but instead variables be converted to factors as needed and stored in the data frame. That way imputations for factor variables (done using impute.transcan for example) will be correct. Currently reformM does not handle variables that are enclosed in functions such as I().