14
votes

I am using scikit-learn Multinomial Naive Bayes classifier for binary text classification (classifier tells me whether the document belongs to the category X or not). I use a balanced dataset to train my model and a balanced test set to test it and the results are very promising.

This classifer needs to run in real time and constantly analyze documents thrown at it randomly.

However, when I run my classifier in production, the number of false positives is very high and therefore I end up with a very low precision. The reason is simple: there are many more negative samples that the classifer encounters in the real-time scenario (around 90 % of the time) and this does not correspond to the ideal balanced dataset I used for testing and training.

Is there a way I can simulate this real-time case during training or are there any tricks that I can use (including pre-processing on the documents to see if they are suitable for the classifer)?

I was planning to train my classifier using an imbalanced dataset with the same proportions as I have in real-time case but I am afraid that might bias Naive Bayes towards the negative class and lose the recall I have on the positive class.

Any advice is appreciated.

3
I think that you know the problem and the solution. You need to provide a sampling database of your real scenario. However did you try a cross-validation technique?gustavodidomenico
I use cross-validation to pick my model parameters (smoothing parameter, for example). I have read that an imbalanced dataset is not good for Naive Bayes, would you still recommend it? Then wouldn't it just classify everything as negative?Erol
I think that all classification algorithm won't perform well in a unbalanced data set with a balanced training sampling. The unbalanced data set is a common problem in data mining. I would recommend you to search ways to improve your dataset. However I am sure that you will get a better result using a decision tree based algorithm like Cart or J48. Have you ever tried?gustavodidomenico
About the "everything negative" will depend on your calibration. Do you know the WEKA tool?gustavodidomenico
I'ld ask in stats.stackexchange.com as well.Dominik Antal

3 Answers

11
votes

You have encountered one of the problems with classification with a highly imbalanced class distribution. I have to disagree with those that state the problem is with the Naive Bayes method, and I'll provide an explanation which should hopefully illustrate what the problem is.

Imagine your false positive rate is 0.01, and your true positive rate is 0.9. This means your false negative rate is 0.1 and your true negative rate is 0.99.

Imagine an idealised test scenario where you have 100 test cases from each class. You'll get (in expectation) 1 false positive and 90 true positives. Great! Precision is 90 / (90+1) on your positive class!

Now imagine there are 1000 times more negative examples than positive. Same 100 positive examples at test, but now there are 1000000 negative examples. You now get the same 90 true positives, but (0.01 * 1000000) = 10000 false positives. Disaster! Your precision is now almost zero (90 / (90+10000)).

The point here is that the performance of the classifier hasn't changed; false positive and true positive rates remained constant, but the balance changed and your precision figures dived as a result.

What to do about it is harder. If your scores are separable but the threshold is wrong, you should look at the ROC curve for thresholds based on the posterior probability and look to see if there's somewhere where you get the kind of performance you want. If your scores are not separable, try a bunch of different classifiers and see if you can get one where they are (logistic regression is pretty much a drop-in replacement for Naive Bayes; you might want to experiment with some non-linear classifiers, however, like a neural net or non-linear SVM, as you can often end up with non-linear boundaries delineating the space of a very small class).

To simulate this effect from a balanced test set, you can simply multiply instance counts by an appropriate multiplier in the contingency table (for instance, if your negative class is 10x the size of the positive, make every negative instance in testing add 10 counts to the contingency table instead of 1).

I hope that's of some help at least understanding the problem you're facing.

3
votes

As @Ben Allison says, the issue you're facing is basically that your classifier's accuracy isn't good enough - or, more specifically: its false positive rate is too high for the class distribution it encountres.

The "textbook" solution would indeed be to train the classifier using a balanced training set, getting a "good" classifier, then find a point on the classifier's performance curve (e.g. ROC curve) which best balances between your accuracy requirements; I assume that in your case, it would be biased towards lower false positive rate, and higher false negative rate.

However, the situation may well be that the classifier is just not good enough for your requirements - at the point where the false positives are in a reasonable level, you might be missing too many good cases.

One solution for that would be, of course, to use more data, or to try another type of classifier; e.g. linear/logistic regression or SVM, which generally have good performance in text classification.

Having said that, the case may be that you prefer using Naive Bayes for some reason (e.g. constraints on train time, frequent addition of new classes or pre-exsiting models). In that case, I can give some practical advice on what can be done.

  1. Assuming you already have a workflow for building Naive Bayes classifiers, you might want to consider Boosting. Generally, these methods would train several weaker classifiers in a way which results with a stronger classifier. Boosting Naive Bayes classifiers has been shown to work nicely, e.g. see here. Best results would be achieved with a sizable and clean train set.
  2. Another practical and simple solution (although less "pretty") would be to add another layer after the existing classifier, of a simple binomial Naive Bayes classifier with a threshold - in essence, a "keyword" filter, which would output as positives only documents containing at least n words from a given dictionary (you can also allow some words to be counted more than once). Depending on your problem domain, it might be possible to construct such a dictionary manually. After some trial and error, I have seen this method significantly improving false positive rate, while only modestly hurting false negatives.
2
votes

I think gustavodidomenico makes a good point. You can think of Naive Bayes as learning a probability distribution, in this case of words belonging to topics. So the balance of the training data matters. If you use decision trees, say a random forest model, you learn rules for making the assignment (yes there are probability distributions involved and I apologise for the hand waving explanation but sometimes intuition helps). In many cases trees are more robust than Naive Bayes, arguably for this reason.