1
votes

I am developing a naive bayes classifier using simple bag of words concept. My question is in naive bayes or in any other machine learning senario 'training' the classifier is an important matter. But how to train naive bayes classifier when I already have a bag_of_words of various classes.

2
Have a look at this tutorialTim Biegeleisen
@TimBiegeleisen I have read the turorial. But a question still remains. Suppose I have two classes positive and negative. Now in my training data set of postive class I have a no. of positive strings and in negative class I also have a no. of negative strings. But in positive strings not all the words are postive. There the problem arises. When I try to take words from them and put them in postive bag_of_words then some negative words are also being added which hampers the later classification.Pritam
@Pritam is the positive or negative slant of the words depending of the context? If so you need to add the context as features in your X vector for each sample (word). Otherwise, how would the classifier be able to distinguish?miraculixx

2 Answers

1
votes

how to train naive bayes classifier when I already have a bag_of_words of various classes.

In general, what you do is this:

  1. split your bag of words into two random subsets, call one training the other test
  2. train the classifier on the training subset
  3. validate the classifier's accuracy by running it against the test subset

'training' the classifier is an important matter

indeed -- that's how your classifier learns to separate words from different classes.

0
votes

The Stanford IR book gives a good explanation of how Naive Bayes classifiers work, and they use text classification as their example. The Wikipedia article also gives a detailed description of the theory and some concrete examples.

In a nutshell, you count the occurrences of each word type within each class, and then normalize by the number of documents to get the probability of word given class p(w|c). You then use Bayes' rule to get the probability of each class given the document p(c|doc) = p(c)*p(doc|c), where the probability of the document given the class is the product of the probabilities of its words given the class p(doc|c) = Π(w in doc) p(w|c). These probabilities get very small before normalizing between the classes, so you may want to take the logarithm and sum them to avoid underflow errors.