3
votes

I am doing the text categorization machine learning problem using Naive Bayes. I have each word as a feature. I have been able to implement it and I am getting good accuracy.

Is it possible for me to use tuples of words as features?

For example, if there are two classes, Politics and sports. The word called government might appear in both of them. However, in politics I can have a tuple (government, democracy) whereas in the class sports I can have a tuple (government, sportsman). So, if a new text article comes in which is politics, the probability of the tuple (government, democracy) has more probability than the tuple (government, sportsman).

I am asking this is because by doing this am I violating the independence assumption of the Naive Bayes problem, because I am considering single words as features too.

Also, I am thinking of adding weights to features. For example, a 3-tuple feature will have less weight than a 4-tuple feature.

Theoretically, are these two approaches not changing the independence assumptions on the Naive Bayes classifier? Also, I have not started with the approach I mentioned yet but will this improve the accuracy? I think the accuracy might not improve but the amount of training data required to get the same accuracy would be less.

2

2 Answers

5
votes

Even without adding bigrams, real documents already violate the independence assumption. Conditioned on having Obama in a document, President is much more likely to appear. Nonetheless, naive bayes still does a decent job at classification, even if the probability estimates it gives are hopelessly off. So I recommend that you go on and add more complex features to your classifier and see if they improve accuracy.

If you get the same accuracy with less data, that is basically equivalent to getting better accuracy with the same amount of data.

On the other hand, using simpler, more common features works better as you decrease the amount of data. If you try to fit too many parameters to too little data, you tend to overfit badly.

But the bottom line is to try it and see.

2
votes

No, from a theoretical viewpoint, you are not changing the independence assumption. You are simply creating a modified (or new) sample space. In general, once you start using higher n-grams as events in your sample space, data sparsity becomes a problem. I think using tuples will lead to the same issue. You will probably need more training data, not less. You will probably also have to give a little more thought to the type of smoothing you use. Simple Laplace smoothing may not be ideal.

Most important point, I think, is this: whatever classifier you are using, the features are highly dependent on the domain (and sometimes even the dataset). For example, if you are classifying sentiment of texts based on movie reviews, using only unigrams may seem to be counterintuitive, but they perform better than using only adjectives. On the other hand, for twitter datasets, a combination of unigrams and bigrams were found to be good, but higher n-grams were not useful. Based on such reports (ref. Pang and Lee, Opinion mining and Sentiment Analysis), I think using longer tuples will show similar results, since, after all, tuples of words are simply points in a higher-dimensional space. The basic algorithm behaves the same way.