14
votes

I am comparing two Naive Bayes classifiers: one from NLTK and and one from scikit-learn. I'm dealing with a multi-class classification problem (3 classes: positive (1), negative (-1), and neutral (0)).

Without performing any feature selection (that is, using all features available), and using a training dataset of 70,000 instances (noisy-labeled, with an instance distribution of 17% positive, 4% negative and 78% neutral), I train two classifiers, the first one is a nltk.NaiveBayesClassifier, and the second one is a sklearn.naive_bayes.MultinomialNB (with fit_prior=True).

After training, I evaluated the classifiers on my test set of 30,000 instances and I get the following results:

**NLTK's NaiveBayes**
accuracy: 0.568740
class: 1
     precision: 0.331229
     recall: 0.331565
     F-Measure: 0.331355
class: -1
     precision: 0.079253 
     recall: 0.446331 
     F-Measure: 0.134596 
class: 0
     precision: 0.849842 
     recall: 0.628126 
     F-Measure: 0.722347 


**Scikit's MultinomialNB (with fit_prior=True)**
accuracy: 0.834670
class: 1
     precision: 0.400247
     recall: 0.125359
     F-Measure: 0.190917
class: -1
     precision: 0.330836
     recall: 0.012441
     F-Measure: 0.023939
class: 0
     precision: 0.852997
     recall: 0.973406
     F-Measure: 0.909191

**Scikit's MultinomialNB (with fit_prior=False)**
accuracy: 0.834680
class: 1
     precision: 0.400380
     recall: 0.125361
     F-Measure: 0.190934
class: -1
     precision: 0.330836
     recall: 0.012441
     F-Measure: 0.023939
class: 0
     precision: 0.852998
     recall: 0.973418
     F-Measure: 0.909197

I have noticed that while Scikit's classifier has better overall accuracy and precision, its recall is very low compared to the NLTK one, at least with my data. Taking into account that they might be (almost) the same classifiers, isn't this strange?

2
What are the features? Did you try a BernoulliNB as well? That should be closer to the NLTK Naive Bayes.Fred Foo
Thanks for the reply. The features are words with value 1 if they exist in the document (boolean). The results for scikits BernoulliNB are very close to MultinomialNB: accuracy: 0.834680 class: 1 precision: 0.400380 recall: 0.125361 F-Measure: 0.190934 class: -1 precision: 0.330836 recall: 0.012441 F-Measure: 0.023939 class: 0 precision: 0.852998 recall: 0.973418 F-Measure: 0.909197D T
The only thing I can see in the documentation is that NLTK's NB classifier apparently doesn't do smoothing. I wouldn't expect that to cause a big difference, though...Fred Foo

2 Answers

3
votes

Is the default behavior for class weights the same in both libraries? The difference in precision for the rare class (-1) looks like that might be the cause...

3
votes

Naive Bayes classifier usually means a Bayesian classfier over binary features that are assumed to be independent. This is what NLTK's Naive Bayes classifier implements. The corresponding scikit classifier is BernoulliNB classifier.

The restriction to boolean valued features is not actually necessary, it is just the simplest to implement. A naive Bayes classifier can be defined for (assumed) independent features from any parametric distribution.

MultinomialNB is for data with integer valued input features that are assumed to be multinomially distributed.

Sckit also has GaussianNB that for continuous valued features that are assumed to idependently Gaussian distributed.