I am working on Random Forest classification.
I found that cforest in "party" package usually performs better than "randomForest".
However, it seemed that cforest easily overfitted.
A toy example
Here is a random data set that includes response of binary factor and 10 numerical variables generated from rnorm().
# Sorry for redundant preparation.
data <- data.frame(response=rnorm(100))
data$response <- factor(data$response < 0)
data <- cbind(data, matrix(rnorm(1000), ncol=10))
colnames(data)[-1] <- paste("V",1:10,sep="")
Perform cforest, employing unbiased parameter set (maybe recommended).
cf <- cforest(response ~ ., data=data, controls=cforest_unbiased())
table(predict(cf), data$response)
# FALSE TRUE
# FALSE 45 7
# TRUE 6 42
Fairly good prediction performance on meaningless data.
On the other hand, randomForest goes honestly.
rf <- randomForest(response ~., data=data)
table(predict(rf),data$response)
# FALSE TRUE
# FALSE 25 27
# TRUE 26 22
Where these differences come from?
I am afraid that I am using cforest in a wrong way.
Let me put some extra observations in cforest:
- The number of variables did not much affect the result.
- Variable importance values (computed by varimp(cf)) were rather low, compared to those using some realistic explanatory variables.
- AUC of ROC curve was nearly 1.
I would appreciate your advices.
Additional note
Some wondered why a training data set was applied to the predict().
I did not prepare any test data set because the prediction was done for OOB samples, which was not true for cforest.
c.f. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm