Balanced corpus for Naive Bayes Classifier

Question

I'm working with sentiment analysis using NB classifier. I've found some information (blogs, tutorials etc) that training corpus should be balanced:

33.3% Positive;
33.3% Neutral
33.3% Negative

My question is:

Why corspus should be balanced? The Bayes theorem is based on propability of reason/case. So for training purpose isn't it important that in real world for example negative tweets are only 10% not 33.3%?

lejlot lejlot · Accepted Answer · 2017-07-02T20:48:47

You are correct, balancing data is important for many discriminative models, but not really for NB.

However, it might be still more beneficial to bias P(y) estimators to get better predictive performance (since due to various simplifications models use, probability assigned to minority class can be heaviy underfitted). For NB it is not about balancing data, but literally modifying the estimated P(y) so that on the validation set accuracy is maximised.

Balanced corpus for Naive Bayes Classifier

2 Answers