7
votes

I am attempting a random forest on some data where the class variables is binary (either 1 or 0). Here is the code I'm running:

forest.model <- randomForest(x = ticdata2000[,1:85], y = ticdata2000[,86], 
                       ntree=500,
                       mtry=9,
                       importance=TRUE,
                       norm.votes=TRUE,
                       na.action=na.roughfix,
                       replace=FALSE,
                             )

But when the forest gets to the end, I get the following error:

Warning message:
In randomForest.default(x = ticdata2000[, 1:85], y = ticdata2000[,  :
  The response has five or fewer unique values.  Are you sure you want to do regression?

The answer, of course, is no. I don't want to do regression. I have a single, discrete variable that only takes on 2 classes. Of course, when I run predictions with this model, I get continuous numbers, when I want a list of zeroes and ones. Can someone tell me what I'm doing wrong to get this to use regression and not classification?

1

1 Answers

12
votes

Change your response column to a factor using as.factor (or just factor). Since you've stored that variable as numeric 0's and 1's, R rightly interprets it as a numeric variable. If you want R to treat it differently, you have to tell it so.

This is mentioned in the documentation under the y argument:

A response vector. If a factor, classification is assumed, otherwise regression is assumed. If omitted, randomForest will run in unsupervised mode.