I'm trying to build a classification model on 60 variables and ~20,000 observations using the train() fx within the caret package. I'm using the random forest method and am returning 0.999 Accuracy on my training set, however when I use the model to predict, it classifies each test observation as the same class (i.e. each of the 20 observations are classified as "1's" out of 5 possible outcomes). I'm certain this is wrong (the test set is for a Coursera quiz, hence my not posting exact code) but I'm not sure what is happening.
My question is that when I call the final model of fit (fit$finalModel), it says it made 500 total trees (default and expected), however the number of variables tried at each split is 35. I know that will classification, the standard number of observations chosen for each split is the square root of the number of total predictors (therefore, should be sqrt(60) = 7.7, call it 8). Could this be the problem??
I'm confused on whether there is something wrong with my model or my data cleaning, etc.
set.seed(10000)
fitControl <- trainControl(method = "cv", number = 5)
fit <- train(y ~ ., data = training, method = "rf", trControl = fitControl)
fit$finalModel
Call:
randomForest(x = x, y = y, mtry = param$mtry)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 41
OOB estimate of error rate: 0.01%