I have a question regarding the use of the predict()
function with the RcppArmadillo
and RcppEigen
packages, when a factor variable has only one level. I built a MWE below using the iris
dataset.
I want to first estimate a linear regression model using RcppArmadillo
, and then use it to predict values. The data I use for estimation contains factor variables (with more than one level and no NA
). The prediction I want to make is slightly unusual in one respect: I want to predict values using the same factor level for all observations (this level being in levels used in the estimation). In the example below, it means that I want to predict Sepal.Length
as if all observations were from the "versicolor" Species.
This works well when I estimate the model using the lm()
function, but does not work with the RcppArmadillo::fastLm()
or RcppEigen::fastLm()
functions. I get the following error: Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels
. The same error happens again if one of the factor levels is missing. I understand well why at least two levels are necessary for estimation, but I don't understand why having only one level is a problem for prediction once the model has been properly estimated.
The obvious solution would be to use lm()
instead of fastLm()
, but this is unfortunately impossible because my data is quite big. After some trial-and-error, I found this dirty work-around:
- I stack two versions of the data: the first one is the original data (with all factor levels), the second one is the modified data (with the same factor level for all observations);
- I predict values on this dataset (it works because all factor levels are present in this dataset);
- I keep only the modified subset of data.
Does anyone have a better solution than this, or at least an explanation as to why this error exists?
library(data.table)
# Loading iris data
iris <- as.data.table(iris)
# Estimating the model
model <-
RcppArmadillo::fastLm(Sepal.Length ~
factor(Species)
+ Sepal.Width
+ Petal.Length
+ Petal.Width,
data=iris)
summary(model)
####
#### Here is the error I don't understand
####
# This is the standard use of the predict function
iris2 <- copy(iris)
iris2[, predict := predict(model, iris2)]
# This is the way I want to use the predict function
# This does not work for some reason
iris2 <- copy(iris)
iris2[, Species := "versicolor"]
iris2[, predict2 := predict(model, iris2)]
####
#### This is a dirty work-around
####
# Creating a modified dataframe
iris3 <- copy(iris)
iris3[, `:=`(Species = "versicolor",
data = "Modified data")]
# copying the original dataframe
iris4 <- copy(iris)
iris4[, data := "Original data"]
# Stacking the original data and the modified data
iris5 <- rbind(iris3, iris4)
iris5[, predict := predict(model, iris5)]
# Keeping only the modified data
iris_final <- iris5[data == "Modified data"]