R - caret::train “random forest” parameters

Question

I'm trying to build a classification model on 60 variables and ~20,000 observations using the train() fx within the caret package. I'm using the random forest method and am returning 0.999 Accuracy on my training set, however when I use the model to predict, it classifies each test observation as the same class (i.e. each of the 20 observations are classified as "1's" out of 5 possible outcomes). I'm certain this is wrong (the test set is for a Coursera quiz, hence my not posting exact code) but I'm not sure what is happening.

My question is that when I call the final model of fit (fit$finalModel), it says it made 500 total trees (default and expected), however the number of variables tried at each split is 35. I know that will classification, the standard number of observations chosen for each split is the square root of the number of total predictors (therefore, should be sqrt(60) = 7.7, call it 8). Could this be the problem??

I'm confused on whether there is something wrong with my model or my data cleaning, etc.

set.seed(10000)
fitControl <- trainControl(method = "cv", number = 5)
fit <- train(y ~ ., data = training, method = "rf", trControl = fitControl)

fit$finalModel

Call:
 randomForest(x = x, y = y, mtry = param$mtry) 
           Type of random forest: classification
                 Number of trees: 500
No. of variables tried at each split: 41

    OOB estimate of  error rate: 0.01%

I would also look at your tree depth that is one of the easiest way to over-fit a random forest. I suggest doing a hyper parameter grid search on tree depth and number of variables to start out with. You should also confirm with your own validation set. Also are your classes balance? Setting a higher nodesize may also help. — Ian Wesley
The package author states that random forests is resistant to overfitting. I still have difficulty understanding exactly how this is the case. I believe it is something to do with bootstrapping when training the model. — Seanosapien

Len Greski Len Greski · Accepted Answer · 2018-04-15T02:24:14

Use of Random Forest for final project for the Johns Hopkins Practical Machine Learning course on Coursera will generate the same prediction for all 20 test cases for the quiz if students fail to remove independent variables that have more than 50% NA values.

SOLUTION: remove variables that have a high proportion of missing values from the model.

R - caret::train “random forest” parameters

1 Answers