I am developing a naive bayes classifier using simple bag of words concept. My question is in naive bayes or in any other machine learning senario 'training' the classifier is an important matter. But how to train naive bayes classifier when I already have a bag_of_words of various classes.
2 Answers
how to train naive bayes classifier when I already have a bag_of_words of various classes.
In general, what you do is this:
- split your bag of words into two random subsets, call one
training
the othertest
- train the classifier on the
training
subset - validate the classifier's accuracy by running it against the
test
subset
'training' the classifier is an important matter
indeed -- that's how your classifier learns to separate words from different classes.
The Stanford IR book gives a good explanation of how Naive Bayes classifiers work, and they use text classification as their example. The Wikipedia article also gives a detailed description of the theory and some concrete examples.
In a nutshell, you count the occurrences of each word type within each class, and then normalize by the number of documents to get the probability of word given class p(w|c). You then use Bayes' rule to get the probability of each class given the document p(c|doc) = p(c)*p(doc|c), where the probability of the document given the class is the product of the probabilities of its words given the class p(doc|c) = Π(w in doc) p(w|c). These probabilities get very small before normalizing between the classes, so you may want to take the logarithm and sum them to avoid underflow errors.
X
vector for each sample (word). Otherwise, how would the classifier be able to distinguish? – miraculixx