For a large dataset text classification problem, I used various classifiers including LDA, RandomForest, kNN etc. and got accuracy rates of 78-85%. However, Multinomial Naive Bayes using bnlearn gave an accuracy of 97%. Investigated why the accuracy is so high and the issue appears to be with the prediction in bnlearn - maybe I have used the wrong parameters.
Illustrating using a sample data set.
Long Sweet Yellow Fruit
Yes Yes Yes Banana
Yes Yes Yes Banana
Yes Yes Yes Banana
Yes Yes Yes Banana
No Yes Yes Banana
No Yes Yes Orange
No Yes Yes Orange
No Yes Yes Orange
Yes Yes Yes Other
No Yes No Other
Yes Yes Yes Banana
Yes Yes Yes Banana
Yes No Yes Banana
Yes No No Banana
No No Yes Banana
No No Yes Orange
No No Yes Orange
No No Yes Orange
Yes Yes No Other
No No No Other
Yes Yes Yes Banana
No Yes Yes Banana
No Yes Yes Orange
No Yes Yes Orange
No Yes No Other
The above is a dataset of 25 rows loaded as a dataframe bn.X
This can be split into a 20 row training data set and 5 row test data set.
Step 1: Loading the data
Y=bn.X[,4] # Outcome column
train=1:20
cols=1:4
bn.X[,cols] <- data.frame(apply(bn.X[cols], 2, as.factor))
trainbn.X=bn.X[train,]
testbn.X=bn.X[-train,]
trainbn.Y=Y[train]
testbn.Y=Y[-train]
Step 2: Classification using bnlearn
library(bnlearn)
NB.fit = naive.bayes(trainbn.X, "Fruit")
# Prediction
NB.pred=predict(NB.fit,testbn.X,prob=TRUE)
writeLines("\n Multinomial Naive Bayes\n")
table(NB.pred, testbn.Y)
cat("Accuracy %:", mean(NB.pred == testbn.Y )*100)
Step 3: Classification using LDA
library(MASS)
lda.fit=lda(Fruit~.,data=trainbn.X)
# Prediction
lda.pred=predict(lda.fit,testbn.X)
lda.class=lda.pred$class
writeLines("\n LDA \n")
table(lda.class,testbn.Y)
cat("Accuracy %:", mean(lda.class == testbn.Y )*100)
Both bnlearn Naive Bayes and LDA give the same prediction with 80% accuracy for the 5 rows.
However, bnlearn seems to be using the outcome values of the test rows as well for prediction. This seems to be the reason I got a high accuracy value for the text classification scenario I was working on.
If I do any one of the following before prediction,
testbn.X$Fruit=NA
testbn.X$Fruit="Orange"
testbn.X[1:3,]$Fruit="Orange"
There is no impact on the results of LDA - LDA completely ignores the outcome values of the test data if provided. This is ideal behavior.
However, for bnlearn , this is not the case. During prediction, an error is received for NA and all values="Orange" . And for the 3rd data manipulation, bnlearn prediction returns a completely different result
Question: Is the way I have used the predict function of bnlearn right? Should I be using different parameters?