I am trying to develop a model to predict the WaitingTime variable. I am running a random forest on the following dataset.
$ BookingId : Factor w/ 589855 levels "00002100-1E20-E411-BEB6-0050568C445E",..: 223781 471484 372126 141550 246376 512394 566217 38486 560536 485266 ...
$ PickupLocality : int 1 67 77 -1 33 69 67 67 67 67 ...
$ ExZone : int 0 0 0 0 1 1 0 0 0 0 ...
$ BookingSource : int 2 2 2 2 2 2 7 7 7 7 ...
$ StarCustomer : int 1 1 1 1 1 1 1 1 1 1 ...
$ PickupZone : int 24 0 0 0 6 11 0 0 0 0 ...
$ ScheduledStart_Day : int 14 20 22 24 24 24 31 31 31 31 ...
$ ScheduledStart_Month : int 6 6 6 6 6 6 7 7 7 7 ...
$ ScheduledStart_Hour : int 14 17 7 2 8 8 1 2 2 2 ...
$ ScheduledStart_Minute : int 6 0 58 55 53 54 54 0 12 19 ...
$ ScheduledStart_WeekDay: int 1 7 2 4 4 4 6 6 6 6 ...
$ Season : int 1 1 1 1 1 1 1 1 1 1 ...
$ Pax : int 1 3 2 4 2 2 2 4 1 4 ...
$ WaitingTime : int 45 10 25 5 15 25 40 15 40 30 ...
I am splitting the dataset into training/test subsets into 80%/20% using the sample method and then running a random forest excluding the BookingId factor. This is only used to validate the predictions.
set.seed(1)
index <- sample(1:nrow(data),round(0.8*nrow(data)))
train <- data[index,]
test <- data[-index,]
library(randomForest)
extractFeatures <- function(data) {
features <- c( "PickupLocality",
"BookingSource",
"StarCustomer",
"ScheduledStart_Month",
"ScheduledStart_Day",
"ScheduledStart_WeekDay",
"ScheduledStart_Hour",
"Season",
"Pax")
fea <- data[,features]
return(fea)
}
rf <- randomForest(extractFeatures(train), as.factor(train$WaitingTime), ntree=600, mtry=2, importance=TRUE)
The problem is that all attempts to try and decrease OOB error rate and increase the accuracy failed. The maximum accuracy that I managed to achieve was ~23%.
I tried to change the number of features used, different ntree and mtry values, different training/test ratios, and also taking into consideration only data with WaitingTime <= 40. My last attempt was to follow MrFlick's suggestion and get the same sample size for all classes of get the same sample size for all classes of my predicting variable (WaitingTime).1
tempdata <- subset(tempdata, WaitingTime <= 40)
rndid <- with(tempdata, ave(tempdata$Season, tempdata$WaitingTime, FUN=function(x) {sample.int(length(x))}))
data <- tempdata[rndid<=27780,]
Do you know of any other ways how I can achieve at least accuracy over 50%?
Records by WaitingTime class:
Thanks in advance!