1
votes

I am using randomForest for classification of data and I am unable to understand:

1- How can we obtain the information (preferably in a dataframe of 3 columns) which tells us the real classification in testData (e.g. in below example Species column), prediction made by random forest, and the probability score of that prediction. For example just consider the below data set and 1 case where in testData the Species (blinded information for random forest) was versicolor but it was predicted wrongly by classifier as virginica with a probability score of 0.67. I want this kind of information but don't know how can I obtain this

2- How can we get the confusion matrix for testData and trainingData which also gives us the class.error, like in the case when we print the model.

data(iris)
set.seed(111)
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
trainData <- iris[ind==1,]    
testData <- iris[ind==2,]
#grow forest
iris.rf <- randomForest(Species ~ ., data=trainData)
print(iris.rf)

Call:
 randomForest(formula = Species ~ ., data = trainData) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 3.33%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         45          0         0  0.00000000
versicolor      0         39         1  0.02500000
virginica       0          3        32  0.08571429

**#predict using the training again...**
iris.pred <- predict(iris.rf, trainData)
table(observed = trainData$Species, predicted = iris.pred)

           predicted
observed     setosa versicolor virginica
  setosa         45          0         0
  versicolor      0         40         0
  virginica       0          0        35

**#Testing on testData**
irisPred<-predict(iris.rf, newdata = testData)
table(irisPred, testData$Species)

irisPred     setosa versicolor virginica
setosa          5          0         0
versicolor      0          8         1
virginica       0          2        14
1

1 Answers

2
votes

I used the caret package to run random forest with trainControl:

library(caret)
library(PerformanceAnalytics)

model <- train(Species ~ .,trainData,
           method='rf',TuneLength=3,
           trControl=trainControl(
             method='cv',number=10,
             classProbs = TRUE))
model$results

irisPred_species<-predict(iris.rf, newdata = testData)
irisPred_prob<-predict(iris.rf, newdata = testData, "prob")

out.table <- data.frame(actual.species = testData$Species, pred.species = irisPred_species, irisPred_prob)

You can get the error rate by:

iris.rf$err.rate

And the confusion matrix:

iris.rf$confusion