5
votes

I have been implementing Multinomial Naive Bayes Classifier from scratch for text classification in python.

I calculate the feature count for each classes and probability distributions for features.

According to my implementation I get the following results:

Suppose I have the following corpus:

corpus = [
            {'text': 'what is chat service?', 'category': 'what_is_chat_service'},
            {'text': 'Why should I use your chat service?', 'category': 'why_use_chat_service'}
        ]

According to Naive Bayes for this corpus prior probabilities for both classes will be 0.5

If I do the some preprocessing including convert to lowercase, stop word remove and punctuation remove I get the following token list:

  • text 1: [chat, service]
  • text 2: [use, chat, service]

Now If I want to predict the class for the text "what is chat service" after preprocessing according to Naive Bayes Rule we get the following probabilities:

class                     chat     service   P(class|features)

what_is_chat_service      1        1         0.5
why_use_chat_service      1        1         0.5

I get the equal probabilities for the 2 classes. I have been studying for the improvement for this situation.

one possible way is to include the stop words. If we include the stop words we get the following features probabilities:

class                   what      is    chat     service   P(class|features)

what_is_chat_service    1         1     1        1         0.5(higher)
why_use_chat_service    1e-9      1e-9  1        1         5e-19

Assuming default probability for a feature = 1e-9

i.e which feature does not belong to a class

In that case we get higher probability for class 1: what_is_chat_service

Still equal probabilities after including stop words If our corpus is like as follows:

corpus = [
            {'text': 'what is chat service?', 'category': 'what_is_chat_service'},
            {'text': 'what is the benefit of using chat service?', 'category': 'why_use_chat_service'}
        ]

In that case all the features probabilities will be 1 for both classes.

And the probabilities for predicting the text "what is chat service?" will also be equal.

But I have to get the 'what_is_chat_service' class predicted.

How can I get the desired class predicted? I have tried Naive Bayes Classifier from sklearn. Did not get the desired result.

If my question is verbose or unclear or if more information required then please let me know.

thanks in advance.

1

1 Answers

1
votes

Naive Bayes does not take into account word order. So it is good to categorize the main topics of a document (normally not just a sentence, but a complete document: many paragraphs, like a news article for instance).

In your examples, the topic is really "chat service" (or maybe "web service" or "customer service").

But "why a chat service" vs "what is a chat service" is not really something that can be easily separated by a text classifier, since the difference between why and what is mostly syntactical. For instance, the following sentences:

  1. what is a chat service (you want the what category)
  2. what is a chat service for (you want the why category)

Only accurate syntactical analysis of the sentences would help here (and this task is very hard). Any approach using a bag-of-words (or vector space model) used by almost all text classifiers will likely fail at this task.

Now I know my answer does not help much, but it is what it is. If you want to get slightly better classification while still using a Naive Bayes Classifier, maybe you can try adding n-grams to your features (sequences of words). That would capture (sometimes, not always) some syntactical information.