I have been implementing Multinomial Naive Bayes Classifier from scratch for text classification in python.
I calculate the feature count for each classes and probability distributions for features.
According to my implementation I get the following results:
Suppose I have the following corpus:
corpus = [
{'text': 'what is chat service?', 'category': 'what_is_chat_service'},
{'text': 'Why should I use your chat service?', 'category': 'why_use_chat_service'}
]
According to Naive Bayes for this corpus prior probabilities for both classes will be 0.5
If I do the some preprocessing including convert to lowercase, stop word remove and punctuation remove I get the following token list:
- text 1: [chat, service]
- text 2: [use, chat, service]
Now If I want to predict the class for the text "what is chat service" after preprocessing according to Naive Bayes Rule we get the following probabilities:
class chat service P(class|features)
what_is_chat_service 1 1 0.5
why_use_chat_service 1 1 0.5
I get the equal probabilities for the 2 classes. I have been studying for the improvement for this situation.
one possible way is to include the stop words. If we include the stop words we get the following features probabilities:
class what is chat service P(class|features)
what_is_chat_service 1 1 1 1 0.5(higher)
why_use_chat_service 1e-9 1e-9 1 1 5e-19
Assuming default probability for a feature = 1e-9
i.e which feature does not belong to a class
In that case we get higher probability for class 1: what_is_chat_service
Still equal probabilities after including stop words If our corpus is like as follows:
corpus = [
{'text': 'what is chat service?', 'category': 'what_is_chat_service'},
{'text': 'what is the benefit of using chat service?', 'category': 'why_use_chat_service'}
]
In that case all the features probabilities will be 1 for both classes.
And the probabilities for predicting the text "what is chat service?" will also be equal.
But I have to get the 'what_is_chat_service' class predicted.
How can I get the desired class predicted? I have tried Naive Bayes Classifier from sklearn. Did not get the desired result.
If my question is verbose or unclear or if more information required then please let me know.
thanks in advance.