4
votes

I'm trying to work on a Naive Bayes text classifier. I have already created a bag of words approach in code. In my documents I have noticed many features that are unique to certain classifications. Examples of these features include whether or not the document contains a location , date or name. These are all Boolean values and could be determined before the text is classified. There are other features such as what is the first word etc.

I understand the basic Naive Bayes approach. But have failed to find information on incorporating these features within a classifier.

My question is if it is possible to include the features I mentioned above with bag of words? If so is there an example of this, that I could follow. If this is not the case what would you recommend?

Thank You

1

1 Answers

3
votes

Within the Naive Bayes framework, nothing prevents you from adding additional features that aren't based on a bag-of-words representation. Let's say you have a class likelihood p(document|class_1) = l_1 based on your bag of words features. You have reason to believe that some binary features b_1 and b_2 will also help in the classification (these can be document containing a date and time respectively, to make the example concrete).

You estimate the probability p(b_1 = 1 | class_1) = (# of docs in class 1 with b_1 = 1) / (#of docs in class 1)---p(b_1 = 0 | class_1) = 1 - p(b_1 = 1 | class_1). You do the same for class 2, and for the feature b_2 for both classes. Now to add these features to the classification rule is particularly simple, since Naive Bayes just assumes feature independence. So:

p( class_1 | document ) \propto p(class_1) x l_1 x p(b_1|class_1) x p(b_2|class_1)

where l_1 means the same as before (likelihood based on BOW features), and for the p(b_i|class_1) terms you use either the p(b_i=1|class_1) or p(b_i=0|class_1) terms depending on what the value of b_i actually was. This can be extended to non-binary features in the same way, and you can keep adding to your heart's content (although you should be aware that you're assuming independence between the features, and you may wish to switch to a classifier that doesn't make this assumption).