0
votes

There are very complicated examples on the internet. I couldn't apply them to my codes. I have a dataset consisting 14 independent and 1 dependent variable. I'm making classification with R. Here is my code:

dataset <- read.table("adult.data", sep = ",", na.strings = c(" ?"))
colnames(dataset) <- c( "age", 
                        "workclass", 
                        "fnlwgt", 
                        "education", 
                        "education.num", 
                        "marital.status", 
                        "occupation", 
                        "relationship", 
                        "race", 
                        "sex", 
                        "capital.gain", 
                        "capital.loss", 
                        "hours.per.week", 
                        "native.country",
                        "is.big.50k")
dataset = na.omit(dataset)

library(caret)
set.seed(1)
traning.indices <- createDataPartition(y = dataset$is.big.50k, p = 0.7, list = FALSE)
training.set <- dataset[traning.indices,]
test.set <- dataset[-traning.indices,]

###################################################################
## Naive Bayes
library(e1071)
classifier = naiveBayes(x = training.set[,-15],
                                    y = training.set$is.big.50k)

prediction = predict(classifier, newdata = test.set[,-15])

cm <- confusionMatrix(data = prediction, reference = test.set[,15], 
                      positive = levels(test.set$is.big.50k)[2])

accuracy <- sum(diag(as.matrix(cm))) / sum(as.matrix(cm))

sensitivity <- sensitivity(prediction, test.set[,15], 
                           positive = levels(test.set$is.big.50k)[2])

specificity <- specificity(prediction, test.set[,15], 
                           negative = levels(test.set$is.big.50k)[1])

I tried this. It worked. Is there any mistake? Is there any problem on transformation process? (on as.numeric() method)
library(ROCR) pred <- prediction(as.numeric(prediction), as.numeric(test.set[,15])) perf <- performance(pred, measure = "tpr", x.measure = "fpr") plot(perf, main = "ROC curve for NB", col = "blue", lwd = 3) abline(a = 0, b = 1, lwd = 2, lty = 2)

2
Do you know the brms R package? - patL
I'm making classification. It is regression package. @patL - FK7

2 Answers

0
votes

Try this:

set.seed(1)
library(data.table)
amount = 100
dataset = data.table(
  x = runif(amount, -1, 1)
  ,y = runif(amount, -1, 1)
)
# inside the circle with radius 0.5? -> true, otherwise false
dataset = dataset[, target := (sqrt(x^2 + y^2) < 0.5)]
plot(dataset[target == F]$x, dataset[target == F]$y, col="red", xlim = c(-1, 1), ylim = c(-1, 1))
points(dataset[target == T]$x, dataset[target == T]$y, col="green")

library(caret)

traning.indices <- createDataPartition(y = dataset$target, p = 0.7, list = FALSE)
training.set <- dataset[traning.indices,]
test.set <- dataset[-traning.indices,]

###################################################################
## Naive Bayes
library(e1071)
classifier = naiveBayes(x = training.set[,.(x,y)],
                        y = training.set$target)

prediction = predict(classifier, newdata = test.set[,.(x,y)], type="raw")
prediction = prediction[, 2]
test.set = test.set[, prediction := prediction]

TPrates = c()
TNrates = c()
thresholds = seq(0, 1, by = 0.1)
for (threshold in thresholds) {
  # percentage of correctly classified true examples
  TPrateForThisThreshold = test.set[target == T & prediction > threshold, .N]/test.set[target == T, .N]
  # percentage of correctly classified false examples
  TNrateForThisThreshold = test.set[target == F & prediction <= threshold, .N]/test.set[target == F, .N]

  TPrates = c(TPrates, TPrateForThisThreshold)
  TNrates = c(TNrates, TNrateForThisThreshold)
}

plot(1-TNrates, TPrates, type="l")

Remarks:

You can only plot a ROC curve if you have 'numeric probabilistic' predictions (i.e. a number between 0 and 1) even though you want to predict something that can only be TRUE or FALSE! --> we need to put 'type="raw"' in the prediction line prediction = predict(classifier, newdata = test.set[,.(x,y)], type="raw") in that way the predictions will not be 'TRUE' or 'FALSE' but numbers in between 0 and 1 and the prediction with TRUE/FALSE before is 'numericPrediction >= 0.5' i.e. if the probability exceeds threshold then it is predicted as 'TRUE' and 'FALSE' otherwise.

Who tells us that '0.5' is the correct value for our predictor? Couldn't it be 0.7 or 0.1? Correct! We do not know (ad hoc, without more knowledge about the problem) which threshold is the correct one. That is why we just 'try all of them' (I have only tried 0, 0.1, 0.2, ..., 0.9, 1) and create a confusion matrix with every of these thresholds. In that way we can see how the predictor performs independently of the threshold. If the line 'bows much' into the direction of the perfect classifier (rectangle, i.e. only 100% recall with 0% of 1-specificity) the better the classifier performs.

Interpret the axes!!!

Y-Axis means: How many of the actually positive examples did the predictor detect?

X-Axis means: How wasteful did the predictor spend his predictions?

I.e. if you want to achieve a good rate of detected true examples (for example, when predicting a disease you must be sure that every patient that actually suffers from the disease will really be detected, otherwise the whole point of the predictor is withdrawn). However, just predicting everybody as 'TRUE' does not help! Either the treatment could be harmful or it is simply costly. Hence, we have to opposing players (recall = rate of detected trues, 1-spec = rate of 'wastefulness' of predictor) and every point on the ROC curve is one possible predictor. Now you have to choose the point you want on the ROC curve, check for the threshold that caused this point and use this threshold in the end.

0
votes

For a ROC curve to work, you need some threshold or hyperparameter.

The numeric output of Bayes classifiers tends to be too unreliable (while the binary decision is usually OK), and there is no obvious hyperparameter. You could try treating your prior probability (in a binary problem only!) as parameter, and plot a ROC curve for that.

But by any means, for the curve to exist, you need a map from some curve parameter t to TPR,FPR to get the curve. For example, t could be your prior.