26
votes

I am trying to use quantile regression forest function in R (quantregForest) which is built on Random Forest package. I am getting a type mismatch error that I can't quite figure why.

I train the model by using

qrf <- quantregForest(x = xtrain, y = ytrain)

which works without a problem, but when I try to test with new data like

quant.newdata <- predict(qrf, newdata= xtest)

it gives the following error:

Error in predict.quantregForest(qrf, newdata = xtest) : 
Type of predictors in new data do not match types of the training data.

My training and testing data are coming from separate files (hence separate data frames) but having the same format. I have checked the classes of the predictors with

sapply(xtrain, class)
sapply(xtest, class)

Here is the output:

> sapply(xtrain, class)
pred1     pred2     pred3     pred4     pred5     pred6     pred7     pred8 
"factor" "integer" "integer" "integer"  "factor"  "factor" "integer"  "factor" 
pred9    pred10    pred11    pred12 
"factor"  "factor"  "factor"  "factor" 


> sapply(xtest, class)
pred1     pred2     pred3     pred4     pred5     pred6     pred7     pred8 
"factor" "integer" "integer" "integer"  "factor"  "factor" "integer"  "factor" 
pred9    pred10    pred11    pred12 
"factor"  "factor"  "factor"  "factor" 

They are exactly the same. I also checked for the "NA" values. Neither xtrain nor xtest has a NA value in it. Am I missing something trivial here?

Update I: running the prediction on the training data still gives the same error

> quant.newdata <- predict(qrf, newdata = xtrain)
Error in predict.quantregForest(qrf, newdata = xtrain) : 
names of predictor variables do not match

Update II: I combined my training and test sets so that rows from 1 to 101 are the training data and the rest is the testing. I modified the example provided in (quantregForest) as:

data <-  read.table("toy.txt", header = T)
n <- nrow(data)
indextrain <- 1:101
xtrain <- data[indextrain, 3:14]
xtest <- data[-indextrain, 3:14]
ytrain <- data[indextrain, 15]
ytest <- data[-indextrain, 15]

qrf <- quantregForest(x=xtrain, y=ytrain)
quant.newdata <- predict(qrf, newdata= xtest)

And it works! I'd appreciate if any one could explain why it works this way and not with the other way?

8
Having two pred1 values that have different types doesn't seem like a great idea. Maybe change the factor one to be called `pred1.factor'?Andy Clifton
Thanks for pointing it out. I changed it and reran the sapply's. Still getting the same error both with new data = xtrain and new data = xtestGizem
What happens if you start from a small number of predictors, and add them one at a time?Andy Clifton
Do you know if your factors in both sets contain the same levels? i.e. if you have T / F in your training data, does the corresponding column in your testing data also have both T and F?Karan
@Karan the levels of the factors are different for at least one of the predictors. Why would it be a problem for separate training - test data but not for the a single data set partitioned into training and test?Gizem

8 Answers

37
votes

I had the same problem. You can try to use small trick to equalize classes of training and test set. Bind the first row of training set to the test set and than delete it. For your example it should look like this:

    xtest <- rbind(xtrain[1, ] , xtest)
    xtest <- xtest[-1,]
16
votes

@mgoldwasser is right in general, but there is also a very nasty bug in predict.randomForest: Even if you have exactly the same levels in the training and in the prediction set, it is possible to get this error. This is possible when you have a factor where you have embedded NA as a separate level. The problem is that predict.randomForest essentially does the following:

# Assume your original factor has two "proper" levels + NA level:
f <- factor(c(0,1,NA), exclude=NULL)

length(levels(f)) # => 3
levels(f)         # => "0" "1" NA

# Note that
sum(is.na(f))     # => 0
# i.e., the values of the factor are not `NA` only the corresponding level is.

# Internally predict.randomForest passes the factor (the one of the training set)
# through the function `factor(.)`.
# Unfortunately, it does _not_ do this for the prediction set.
# See what happens to f if we do that:
pf <- factor(f)

length(levels(pf)) # => 2
levels(pf)         # => "0" "1"

# In other words:
length(levels(f)) != length(levels(factor(f))) 
# => sad but TRUE

So, it will always discard the NA level from the training set and will always see one additional level in the prediction set.

A workaround is to replace the value NA of the level before using randomForest:

levels(f)[is.na(levels(f))] <- "NA"
levels(f) # => "0"  "1"  "NA"
          #              .... note that this is no longer a plain `NA`

Now calling factor(f) won't discard the level, and the check succeeds.

16
votes

This happens because your factor variables in training set and test set have different levels(to be more precise test set doesn't have some of the levels present in training). So you can solve this for example by using below code for all your factor variables.:

levels(test$SectionName) <- levels(train$SectionName)
15
votes

Expanding on @user1849895's solution:

common <- intersect(names(train), names(test)) 
for (p in common) { 
  if (class(train[[p]]) == "factor") { 
    levels(test[[p]]) <- levels(train[[p]]) 
  } 
}
2
votes

This is a problem with the levels of each of the different factors. You need to check to make sure that your factor levels stay consistent between your test and training sets.

This is a weird quirk of random forest, and it doesn't make sense to me.

0
votes

I just solved doing the following:

## Creating sample data
values_development=factor(c("a", "b", "c")) ## Values used when building the random forest model
values_production=factor(c("a", "b", "c", "ooops")) ## New values to used when using the model

## Deleting cases which were not present when developing
values_production=sapply(as.character(values_production), function(x) if(x %in% values_development) x else NA)

## Creating the factor variable, (with the correct NA value level)
values_production=factor(values_production)

## Checking
values_production # =>  a     b     c  <NA> 
0
votes

I try to use this way to solved and it works.

get the factor level from the rf model itself directly

levels(PredictData$columnName) <- rfmodels$forest$xlevels$columnName
0
votes
levels(PredictData$columnName) <- rfmodels$forest$xlevels$columnName

However, this will change the original data in PredictData. Hence following code has to be there

x<-PredictData
levels(PredictData$columnName) <- rfmodels$forest$xlevels$columnName

for (i in 1:length(x$columnName))
{
  PredictData$columnName[i] <- x$columnName[i]
}

The above piece of code will solve this error.