I am trying to implement a simple Naive Bayes Classifier
, On training I observed that if keywords (prediction) belongs to both class equally then classifier assigns equal probability to both the classes and if the keywords (prediction) are not present in the training data then also it assigns same probability to both the class.
It is difficult to distinguish between these 2 scenarios. I believe this is happening because of Laplace smoothing of 1 and the probability 0.5 in case 3 is because of a probability of class but I am not sure. Can I do a certain trick to ensure that if keywords are not present in the training data then classifier assigns a probability of zero. As the training data is less is it possible or I should look for some another option for this scenario.
Fruit probability: 0.50 Veggie probability: 0.50
Fruit probability: 0.50 Veggie probability: 0.50
Fruit probability: 0.50 Veggie probability: 0.50
Code
from nltk.classify import NaiveBayesClassifier, accuracy
dataFruits = ['Apple', 'Banana', 'Cherry', 'Grape', 'Guava',
'Lemon', 'Mangos', 'Orange', 'Strawberry', 'Watermelon']
dataVeggies = ['Potato', 'Spinach', 'Carrot', 'Onion', 'Cabbage',
'Broccoli', 'Tomato', 'Pea', 'Cucumber', 'Eggplant']
def basket_features(basket):
basket_items = set(basket)
features = {}
for item in allFeatures:
features['contains({})'.format(item)] = (item in basket_items)
return features
def test(basket):
lst = basket_features(basket)
prob_dist = classifier.prob_classify(lst)
print('\nFruit probability: {:.2f}\tVeggie probability: {:.2f}'.format(prob_dist.prob('fruit'), prob_dist.prob('veggie')))
allFeatures = dataFruits + dataVeggies
class1= [(basket_features([item]), 'fruit') for item in dataFruits]
class2 = [(basket_features([item]), 'veggie') for item in dataVeggies]
train_set = class1[:] + class2
# Train
classifier = NaiveBayesClassifier.train(train_set)
# Predict
test(['Apple', 'Banana', 'Potato', 'Spinach'])
test(['Apple', 'Banana', 'Potato', 'Spinach', 'Strawberry', 'Pea'])
test(['Hello', 'World'])