0
votes

I've encouraged with next problem: I'm trying to classify a lot of text documents.

There are 20 classes: 1 normal, 19 - abnormal. When I use Naïve bayes classification I have the following result: classification works well for 19 classes, but for "normal" class I got many misclassification errors: almost all cases in "normal" category were classified as other (non-normal) category.

There are my questions:

  • How should I select training set for "normal" class? (Now, I just fit to classifier set of text with "normal" category, with 1/20 proportion).
  • Can classifier be specified this way: if probability of belonging to some class less then certain threshold then classifier must set up
    category for this sample (e.g. normal)?
2

2 Answers

1
votes

I'm not sure to have the full picture but It seems like you have in fact only 2 classes "normal" and "abnormal" which are unbalanced in volume and thus prior.

To answer your first question, in that situation, I would try to over-sampling your normal class for training (passing same "normal" instances multiple times to "fake" bigger volume) and see if it improves your performances.

I don't get your second question.

2
votes

Most probably unbalanced number of instances for each classes cause the problem. You need to define some kind of prior over the final class estimation to evade the problem of unbalanced instances and you need to fine tune this prior's exogenous parameter by cross-validation. I guess Dirichlet Prior is in use for multinominal NB.