High OOB error rate for random forest

Question

I am trying to develop a model to predict the WaitingTime variable. I am running a random forest on the following dataset.

$ BookingId          : Factor w/ 589855 levels "00002100-1E20-E411-BEB6-0050568C445E",..: 223781 471484 372126 141550 246376 512394 566217 38486 560536 485266 ...
$ PickupLocality        : int  1 67 77 -1 33 69 67 67 67 67 ...
$ ExZone                : int  0 0 0 0 1 1 0 0 0 0 ...
$ BookingSource         : int  2 2 2 2 2 2 7 7 7 7 ...
$ StarCustomer          : int  1 1 1 1 1 1 1 1 1 1 ...
$ PickupZone            : int  24 0 0 0 6 11 0 0 0 0 ...
$ ScheduledStart_Day    : int  14 20 22 24 24 24 31 31 31 31 ...
$ ScheduledStart_Month  : int  6 6 6 6 6 6 7 7 7 7 ...
$ ScheduledStart_Hour   : int  14 17 7 2 8 8 1 2 2 2 ...
$ ScheduledStart_Minute : int  6 0 58 55 53 54 54 0 12 19 ...
$ ScheduledStart_WeekDay: int  1 7 2 4 4 4 6 6 6 6 ...
$ Season                : int  1 1 1 1 1 1 1 1 1 1 ...
$ Pax                   : int  1 3 2 4 2 2 2 4 1 4 ...
$ WaitingTime           : int  45 10 25 5 15 25 40 15 40 30 ...

I am splitting the dataset into training/test subsets into 80%/20% using the sample method and then running a random forest excluding the BookingId factor. This is only used to validate the predictions.

set.seed(1)
index <- sample(1:nrow(data),round(0.8*nrow(data)))

train <- data[index,]
test <- data[-index,]

library(randomForest)

extractFeatures <- function(data) {
  features <- c(    "PickupLocality",
        "BookingSource",
        "StarCustomer",
        "ScheduledStart_Month",
        "ScheduledStart_Day",
        "ScheduledStart_WeekDay",
        "ScheduledStart_Hour",
        "Season",
        "Pax")
  fea <- data[,features]
  return(fea)
}

rf <- randomForest(extractFeatures(train), as.factor(train$WaitingTime), ntree=600, mtry=2, importance=TRUE)

The problem is that all attempts to try and decrease OOB error rate and increase the accuracy failed. The maximum accuracy that I managed to achieve was ~23%.

I tried to change the number of features used, different ntree and mtry values, different training/test ratios, and also taking into consideration only data with WaitingTime <= 40. My last attempt was to follow MrFlick's suggestion and get the same sample size for all classes of get the same sample size for all classes of my predicting variable (WaitingTime).1

tempdata <- subset(tempdata, WaitingTime <= 40)
rndid <- with(tempdata, ave(tempdata$Season, tempdata$WaitingTime, FUN=function(x) {sample.int(length(x))}))

data <- tempdata[rndid<=27780,]

Do you know of any other ways how I can achieve at least accuracy over 50%?

Records by WaitingTime class:

Thanks in advance!

thc thc · Accepted Answer · 2017-04-26T20:11:55

Messing with the randomForest hyperparameters will almost assuredly not significantly increase your performance.

I would suggest using a regression approach for you data. Since waiting time isn't categorical, a classification approach may not work very well. Your classification model loses the ordering information that 5 < 10 < 15, etc.

One thing to first try is to use a simple linear regression. Bin the predicted values from the test set and recalculate the accuracy. Better? Worse? If it's better, than go ahead and try a randomForest regression model (or as I would prefer, gradient boosted machines).

Secondly, it's possible that your data is just not predictive of the variable that you're interested in. Maybe the data got messed up somehow upstream. It might be a good diagnostic to first calculate correlation and/or mutual information of the predictors with the outcome.

Also, with so many categorical labels, 23% might actually not be that bad. The probability of a particular datapoint to be correctly labeled based on random guess is N_class/N. So the accuracy of a random guess model is not 50%. You can calculate the adjusted rand index to show that it is better than random guess.

High OOB error rate for random forest

1 Answers