How can I implement roc curve analysis for naive bayes classification algorithm in R?

Question

There are very complicated examples on the internet. I couldn't apply them to my codes. I have a dataset consisting 14 independent and 1 dependent variable. I'm making classification with R. Here is my code:

dataset <- read.table("adult.data", sep = ",", na.strings = c(" ?"))
colnames(dataset) <- c( "age", 
                        "workclass", 
                        "fnlwgt", 
                        "education", 
                        "education.num", 
                        "marital.status", 
                        "occupation", 
                        "relationship", 
                        "race", 
                        "sex", 
                        "capital.gain", 
                        "capital.loss", 
                        "hours.per.week", 
                        "native.country",
                        "is.big.50k")
dataset = na.omit(dataset)

library(caret)
set.seed(1)
traning.indices <- createDataPartition(y = dataset$is.big.50k, p = 0.7, list = FALSE)
training.set <- dataset[traning.indices,]
test.set <- dataset[-traning.indices,]

###################################################################
## Naive Bayes
library(e1071)
classifier = naiveBayes(x = training.set[,-15],
                                    y = training.set$is.big.50k)

prediction = predict(classifier, newdata = test.set[,-15])

cm <- confusionMatrix(data = prediction, reference = test.set[,15], 
                      positive = levels(test.set$is.big.50k)[2])

accuracy <- sum(diag(as.matrix(cm))) / sum(as.matrix(cm))

sensitivity <- sensitivity(prediction, test.set[,15], 
                           positive = levels(test.set$is.big.50k)[2])

specificity <- specificity(prediction, test.set[,15], 
                           negative = levels(test.set$is.big.50k)[1])

I tried this. It worked. Is there any mistake? Is there any problem on transformation process? (on as.numeric() method)
library(ROCR) pred <- prediction(as.numeric(prediction), as.numeric(test.set[,15])) perf <- performance(pred, measure = "tpr", x.measure = "fpr") plot(perf, main = "ROC curve for NB", col = "blue", lwd = 3) abline(a = 0, b = 1, lwd = 2, lty = 2)

Fabian Werner Fabian Werner · Accepted Answer · 2017-12-19T09:54:03

Try this:

set.seed(1)
library(data.table)
amount = 100
dataset = data.table(
  x = runif(amount, -1, 1)
  ,y = runif(amount, -1, 1)
)
# inside the circle with radius 0.5? -> true, otherwise false
dataset = dataset[, target := (sqrt(x^2 + y^2) < 0.5)]
plot(dataset[target == F]$x, dataset[target == F]$y, col="red", xlim = c(-1, 1), ylim = c(-1, 1))
points(dataset[target == T]$x, dataset[target == T]$y, col="green")

library(caret)

traning.indices <- createDataPartition(y = dataset$target, p = 0.7, list = FALSE)
training.set <- dataset[traning.indices,]
test.set <- dataset[-traning.indices,]

###################################################################
## Naive Bayes
library(e1071)
classifier = naiveBayes(x = training.set[,.(x,y)],
                        y = training.set$target)

prediction = predict(classifier, newdata = test.set[,.(x,y)], type="raw")
prediction = prediction[, 2]
test.set = test.set[, prediction := prediction]

TPrates = c()
TNrates = c()
thresholds = seq(0, 1, by = 0.1)
for (threshold in thresholds) {
  # percentage of correctly classified true examples
  TPrateForThisThreshold = test.set[target == T & prediction > threshold, .N]/test.set[target == T, .N]
  # percentage of correctly classified false examples
  TNrateForThisThreshold = test.set[target == F & prediction <= threshold, .N]/test.set[target == F, .N]

  TPrates = c(TPrates, TPrateForThisThreshold)
  TNrates = c(TNrates, TNrateForThisThreshold)
}

plot(1-TNrates, TPrates, type="l")

Remarks:

You can only plot a ROC curve if you have 'numeric probabilistic' predictions (i.e. a number between 0 and 1) even though you want to predict something that can only be TRUE or FALSE! --> we need to put 'type="raw"' in the prediction line prediction = predict(classifier, newdata = test.set[,.(x,y)], type="raw") in that way the predictions will not be 'TRUE' or 'FALSE' but numbers in between 0 and 1 and the prediction with TRUE/FALSE before is 'numericPrediction >= 0.5' i.e. if the probability exceeds threshold then it is predicted as 'TRUE' and 'FALSE' otherwise.

Who tells us that '0.5' is the correct value for our predictor? Couldn't it be 0.7 or 0.1? Correct! We do not know (ad hoc, without more knowledge about the problem) which threshold is the correct one. That is why we just 'try all of them' (I have only tried 0, 0.1, 0.2, ..., 0.9, 1) and create a confusion matrix with every of these thresholds. In that way we can see how the predictor performs independently of the threshold. If the line 'bows much' into the direction of the perfect classifier (rectangle, i.e. only 100% recall with 0% of 1-specificity) the better the classifier performs.

Interpret the axes!!!

Y-Axis means: How many of the actually positive examples did the predictor detect?

X-Axis means: How wasteful did the predictor spend his predictions?

I.e. if you want to achieve a good rate of detected true examples (for example, when predicting a disease you must be sure that every patient that actually suffers from the disease will really be detected, otherwise the whole point of the predictor is withdrawn). However, just predicting everybody as 'TRUE' does not help! Either the treatment could be harmful or it is simply costly. Hence, we have to opposing players (recall = rate of detected trues, 1-spec = rate of 'wastefulness' of predictor) and every point on the ROC curve is one possible predictor. Now you have to choose the point you want on the ROC curve, check for the threshold that caused this point and use this threshold in the end.

How can I implement roc curve analysis for naive bayes classification algorithm in R?

2 Answers