I would really appreciate your feedback with the interpretation of my RF model and how to generally evaluate the results.
57658 samples
27 predictor
2 classes: 'stayed', 'left'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 11531, 11531, 11532, 11532, 11532
Resampling results across tuning parameters:
mtry splitrule ROC Sens Spec
2 gini 0.6273579 0.9999011 0.0006250729
2 extratrees 0.6246980 0.9999197 0.0005667791
14 gini 0.5968382 0.9324610 0.1116113149
14 extratrees 0.6192781 0.9740323 0.0523004026
27 gini 0.5584677 0.7546156 0.2977507092
27 extratrees 0.5589923 0.7635036 0.2905489827
Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 1.
After making several adjustments to the functional form of my Y variable, as well as the way I am splitting my data, I got the following results: My ROC improved slightly, but interestingly my Sens & Spec changed drastically compared to my initial model.
35000 samples
27 predictor
2 classes: 'stayed', 'left'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 7000, 7000, 7000, 7000, 7000
Resampling results across tuning parameters:
mtry splitrule ROC Sens Spec
2 gini 0.6351733 0.0004618204 0.9998685
2 extratrees 0.6287926 0.0000000000 0.9999899
14 gini 0.6032979 0.1346653886 0.9170874
14 extratrees 0.6235212 0.0753069696 0.9631711
27 gini 0.5725621 0.3016414054 0.7575899
27 extratrees 0.5716616 0.2998190728 0.7636219
Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 1.
This time, I split the data randomly, rather than by time and experimented with several mtry values using the following code:
```{r Cross Validation Part 1}
set.seed(1992) # setting a seed for replication purposes
folds <- createFolds(train_data$left_welfare, k = 5) # Partition the data into 5 equal folds
tune_mtry <- expand.grid(mtry = c(2,10,15,20), splitrule = c("variance", "extratrees"), min.node.size = c(1,5,10))
sapply(folds,length)
And got the following results:
Random Forest
84172 samples
14 predictor
2 classes: 'stayed', 'left'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 16834, 16834, 16834, 16835, 16835
Resampling results across tuning parameters:
mtry splitrule ROC Sens Spec
2 variance 0.5000000 NaN NaN
2 extratrees 0.7038724 0.3714761 0.8844723
5 variance 0.5000000 NaN NaN
5 extratrees 0.7042525 0.3870192 0.8727755
8 variance 0.5000000 NaN NaN
8 extratrees 0.7014818 0.4075797 0.8545012
10 variance 0.5000000 NaN NaN
10 extratrees 0.6956536 0.4336180 0.8310368
12 variance 0.5000000 NaN NaN
12 extratrees 0.6771292 0.4701687 0.7777730
15 variance 0.5000000 NaN NaN
15 extratrees 0.5000000 NaN NaN
Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 5, splitrule = extratrees and min.node.size = 1.