R: Reproduce original dummy matrix when there are additional factor levels

Question

I have a dummy matrix like:

df1 = data.frame(a=c(1,1,2,2,3,3))
df1$a = as.factor(df1$a)

library(caret)
d <- dummyVars(~ ., data = df1, levelsOnly = TRUE, na.action=na.omit)
predict(d, df1)

  1 2 3
1 1 0 0
2 1 0 0
3 0 1 0
4 0 1 0
5 0 0 1
6 0 0 1

Now I need to map new data - which may have addditional or missing factor levels - to the original dummy matrix (i.e. columns need to be the same).

When I try with predict() new data with an additional level:

df2 = data.frame(a=c(1,1,3,3,4,4))
df2$a = as.factor(df2$a)
predict(d, df2)

I get an error:

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$lvls) : factor a has new levels 4

How can I populate the original dummy matrix with new data?

The desired outcome would be:

Try: df2$a <- factor(df2$a, levels(df1$a)) instead of df2$a = as.factor(df2$a) — GKi
Thanks, but this would replace the missing level with na and would therefore yield a dummy matrix with only 4 instead of 6 rows in the minimal example — Peter
You simply have to overwrite the NA with 0 like: tt <- predict(d, df2); tt[is.na(tt)] <- 0 and you have your dummy matrix in tt. — GKi
You are right... I first thought it would not preserve the matrix order, but I was wrong (and my „desired result“ matrix above is wrong btw. So your solution should do the trick. Thanks! — Peter

GKi GKi · Accepted Answer · 2020-09-02T15:54:26

You can use the levels of df1$a when you create the factor for df2$a and overwrite the resulting NA with 0:

df2 <-  data.frame(a=c(1,1,3,3,4,4))
df2$a <- factor(df2$a, levels(df1$a))
tt <- predict(d, df2)
tt[is.na(tt)] <- 0
tt
#  1 2 3
#1 1 0 0
#2 1 0 0
#3 0 0 1
#4 0 0 1
#5 0 0 0
#6 0 0 0

R: Reproduce original dummy matrix when there are additional factor levels

1 Answers