1
votes

For a large dataset text classification problem, I used various classifiers including LDA, RandomForest, kNN etc. and got accuracy rates of 78-85%. However, Multinomial Naive Bayes using bnlearn gave an accuracy of 97%. Investigated why the accuracy is so high and the issue appears to be with the prediction in bnlearn - maybe I have used the wrong parameters.

Illustrating using a sample data set.

Long    Sweet   Yellow  Fruit

Yes Yes Yes Banana

Yes Yes Yes Banana

Yes Yes Yes Banana

Yes Yes Yes Banana

No  Yes Yes Banana

No  Yes Yes Orange

No  Yes Yes Orange

No  Yes Yes Orange

Yes Yes Yes Other

No  Yes No  Other

Yes Yes Yes Banana

Yes Yes Yes Banana

Yes No  Yes Banana

Yes No  No  Banana

No  No  Yes Banana

No  No  Yes Orange

No  No  Yes Orange

No  No  Yes Orange

Yes Yes No  Other

No  No  No  Other

Yes Yes Yes Banana

No  Yes Yes Banana

No  Yes Yes Orange

No  Yes Yes Orange

No  Yes No  Other

The above is a dataset of 25 rows loaded as a dataframe bn.X
This can be split into a 20 row training data set and 5 row test data set.

Step 1: Loading the data

Y=bn.X[,4] # Outcome column
train=1:20
cols=1:4
bn.X[,cols] <- data.frame(apply(bn.X[cols], 2, as.factor))
trainbn.X=bn.X[train,]
testbn.X=bn.X[-train,]
trainbn.Y=Y[train]
testbn.Y=Y[-train]

Step 2: Classification using bnlearn

library(bnlearn)

NB.fit = naive.bayes(trainbn.X, "Fruit")   

# Prediction    
NB.pred=predict(NB.fit,testbn.X,prob=TRUE)
writeLines("\n Multinomial Naive Bayes\n")
table(NB.pred, testbn.Y)
cat("Accuracy %:", mean(NB.pred == testbn.Y )*100)  

Step 3: Classification using LDA

library(MASS)

lda.fit=lda(Fruit~.,data=trainbn.X)

# Prediction
lda.pred=predict(lda.fit,testbn.X)
lda.class=lda.pred$class
writeLines("\n LDA \n")
table(lda.class,testbn.Y)
cat("Accuracy %:", mean(lda.class == testbn.Y )*100)  

Both bnlearn Naive Bayes and LDA give the same prediction with 80% accuracy for the 5 rows.

However, bnlearn seems to be using the outcome values of the test rows as well for prediction. This seems to be the reason I got a high accuracy value for the text classification scenario I was working on.

If I do any one of the following before prediction,

testbn.X$Fruit=NA
testbn.X$Fruit="Orange"
testbn.X[1:3,]$Fruit="Orange"

There is no impact on the results of LDA - LDA completely ignores the outcome values of the test data if provided. This is ideal behavior.

However, for bnlearn , this is not the case. During prediction, an error is received for NA and all values="Orange" . And for the 3rd data manipulation, bnlearn prediction returns a completely different result

Question: Is the way I have used the predict function of bnlearn right? Should I be using different parameters?

1

1 Answers

4
votes

It seems that the function naive.bayes doesn't actually fit a network, it only defines the structure of a naive bayes network based on the supplied data. If you wish to perform out-of-sample prediction, you need to first estimate the network parameters by using bn.fit on a training set.

The confusion here is caused because the predict method accepts either a network structure or a fitted network as its object parameter. If the object is only a network structure (such as the object returned by naive.bayes), then the network parameters are estimated based on the data supplied to the predict method. As a consequence the predictions obtained in your example are actually in-sample predictions from the test data. From ?naive.bayes under the Note section:

predict accepts either a bn or a bn.fit object as its first argument. For the former, the parameters of the network are fitted on data, that is, the observations whose class labels the function is trying to predict.

If, however, a fitted network is supplied as the object, the fitted network parameters are used for the prediction, and the values of the training variable in the data supplied to the predict method don't affect the predictions. (For some reason NA values seem to still not be allowed for the training variable even with a fitted network.) You can obtain the fitted network by calling bn.fit with your network structure in a bn object and your training_data as bn.fit(bn, training_data).

I haven't used the bnlearn package before, but these are the conclusions I reached from testing and reading the documentation. Here's some code that I used to test the behaviour, based on your work:

# training and testing data
set.seed(1)  # bnlearn uses stochastic tie-breaking
train_idx <- 1:20

train_fruit <- fruit[train_idx, ]
test_fruit <- fruit[-train_idx, ]

library(bnlearn)

nb.net <- naive.bayes(train_fruit, "Fruit")  # network structure
nb.fit <- bn.fit(nb.net, train_fruit)  # fit the network
nb.pred <- predict(nb.fit, test_fruit)  # oos prediction

mean(nb.pred == test_fruit$Fruit)
#    [1] 0.8

# manipulated test data
test_fruit2 <- test_fruit
test_fruit2[1:3, "Fruit"] <- "Orange"

# fitted network as predict object
nb.pred2_fit <- predict(nb.fit, test_fruit2)
identical(nb.pred2_fit, nb.pred)
#    [1] TRUE

# network structure as predict object
nb.pred2_net <- predict(nb.net, test_fruit2)
#     Warning messages:
#     1: In check.data(data, allowed.types = discrete.data.types) :
#       variable Sweet has levels that are not observed in the data.
#     2: In check.data(data, allowed.types = discrete.data.types) :
#       variable Fruit has levels that are not observed in the data.
identical(nb.pred2_net, nb.pred)
#    [1] FALSE

Here is the fruit data set I used in my sample code, read from your post:

fruit <- read.table(header = TRUE, text = "
Long    Sweet   Yellow  Fruit
Yes Yes Yes Banana
Yes Yes Yes Banana
Yes Yes Yes Banana
Yes Yes Yes Banana
No  Yes Yes Banana
No  Yes Yes Orange
No  Yes Yes Orange
No  Yes Yes Orange
Yes Yes Yes Other
No  Yes No  Other
Yes Yes Yes Banana
Yes Yes Yes Banana
Yes No  Yes Banana
Yes No  No  Banana
No  No  Yes Banana
No  No  Yes Orange
No  No  Yes Orange
No  No  Yes Orange
Yes Yes No  Other
No  No  No  Other
Yes Yes Yes Banana
No  Yes Yes Banana
No  Yes Yes Orange
No  Yes Yes Orange
No  Yes No  Other
")