How to handle Naive Bayes Classifier when keywords are not present in training set

Question

I am trying to implement a simple Naive Bayes Classifier, On training I observed that if keywords (prediction) belongs to both class equally then classifier assigns equal probability to both the classes and if the keywords (prediction) are not present in the training data then also it assigns same probability to both the class.

It is difficult to distinguish between these 2 scenarios. I believe this is happening because of Laplace smoothing of 1 and the probability 0.5 in case 3 is because of a probability of class but I am not sure. Can I do a certain trick to ensure that if keywords are not present in the training data then classifier assigns a probability of zero. As the training data is less is it possible or I should look for some another option for this scenario.

Fruit probability: 0.50 Veggie probability: 0.50

Fruit probability: 0.50 Veggie probability: 0.50

Fruit probability: 0.50 Veggie probability: 0.50

Code

from nltk.classify import NaiveBayesClassifier, accuracy

dataFruits = ['Apple', 'Banana', 'Cherry', 'Grape', 'Guava', 
              'Lemon', 'Mangos', 'Orange', 'Strawberry', 'Watermelon']

dataVeggies = ['Potato', 'Spinach', 'Carrot', 'Onion', 'Cabbage', 
               'Broccoli', 'Tomato', 'Pea', 'Cucumber', 'Eggplant']

def basket_features(basket): 
    basket_items = set(basket) 
    features = {}
    for item in allFeatures:
        features['contains({})'.format(item)] = (item in basket_items)
    return features

def test(basket):
    lst = basket_features(basket)
    prob_dist = classifier.prob_classify(lst)
    print('\nFruit probability: {:.2f}\tVeggie probability: {:.2f}'.format(prob_dist.prob('fruit'), prob_dist.prob('veggie')))

allFeatures = dataFruits + dataVeggies

class1= [(basket_features([item]), 'fruit') for item in dataFruits]
class2 = [(basket_features([item]), 'veggie') for item in dataVeggies]

train_set = class1[:] + class2

# Train
classifier = NaiveBayesClassifier.train(train_set)

# Predict
test(['Apple', 'Banana', 'Potato', 'Spinach'])
test(['Apple', 'Banana', 'Potato', 'Spinach', 'Strawberry', 'Pea'])
test(['Hello', 'World'])

rwp rwp · Accepted Answer · 2018-01-28T09:41:09

It sounds like the Naive Bayes Classifier is doing the right thing, namely trying to estimate the (conditional) probability distribution of classes when given some input features. If there aren't any input features that match your training data (your case 3), then it's correct that the output (conditional) probability distribution is flat. In your case, that means that case 3 (no usable input features) is equivalent to case 1 (input features that are present, but don't discriminate at all between fruit and vegetables). If you wanted to distinguish the two cases, it might be helpful to look at the prior probability of your input features - this would show that case 1 looks much more like your training data than case 3, which presumably has no features in common with your training data. Depending on how your NBC is constructed, that prior probability might be strictly zero, or assigned some small value to avoid the risk of calculating a log-probability that is singular.

How to handle Naive Bayes Classifier when keywords are not present in training set

1 Answers