I am using randomForest for classification of data and I am unable to understand:
1- How can we obtain the information (preferably in a dataframe of 3 columns) which tells us the real classification in testData
(e.g. in below example Species
column), prediction made by random forest, and the probability score of that prediction. For example just consider the below data set and 1 case where in testData
the Species (blinded information for random forest) was versicolor but it was predicted wrongly by classifier as virginica with a probability score of 0.67. I want this kind of information but don't know how can I obtain this
2- How can we get the confusion matrix for testData
and trainingData
which also gives us the class.error, like in the case when we print the model.
data(iris)
set.seed(111)
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
trainData <- iris[ind==1,]
testData <- iris[ind==2,]
#grow forest
iris.rf <- randomForest(Species ~ ., data=trainData)
print(iris.rf)
Call:
randomForest(formula = Species ~ ., data = trainData)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 3.33%
Confusion matrix:
setosa versicolor virginica class.error
setosa 45 0 0 0.00000000
versicolor 0 39 1 0.02500000
virginica 0 3 32 0.08571429
**#predict using the training again...**
iris.pred <- predict(iris.rf, trainData)
table(observed = trainData$Species, predicted = iris.pred)
predicted
observed setosa versicolor virginica
setosa 45 0 0
versicolor 0 40 0
virginica 0 0 35
**#Testing on testData**
irisPred<-predict(iris.rf, newdata = testData)
table(irisPred, testData$Species)
irisPred setosa versicolor virginica
setosa 5 0 0
versicolor 0 8 1
virginica 0 2 14