2
votes

I keep on reading that Naive Bayes needs fewer features than many other ML algorithms. But what's the minimum number of features you actually need to get good results (90% accuracy) with a Naive Bayes model? I know there is no objective answer to this -- it depends on your exact features and what in particular you are trying to learn -- but I'm looking for a numerical ballpark answer to this.

I'm asking because I have a dataset with around 280 features and want to understand if this is way too few features to use with Naive Bayes. (I tried running Naive Bayes on my dataset and although I got 86% accuracy, I cannot trust this number as my data is imbalanced and I believe this may be responsible for the high accuracy. I am currently trying to fix this problem.)

In case it's relevant: the exact problem I'm working on is generating time tags for Wikipedia articles. Many times the infobox of a Wikipedia article contains a date. However, many times this date appears in the text of the article but is missing from the infobox. I want to use Naive Bayes to identify which date from all the dates we find in the article's text we should place in the infobox. Every time I find a sentence with a date in it I turn it into a feature vector -- listing what number paragraph I found this in, how many times this particular date appears in the article, etc. I've limited myself to a small subset of Wikipedia articles -- just apple articles -- and as a result, I only have 280 or so features. Any idea if this is enough data?

Thanks!

1
For the record this kind of question would fit better on datascience.stackexchange.comErwan

1 Answers

1
votes

I know there is no objective answer to this -- it depends on your exact features and what in particular you are trying to learn -- but I'm looking for a numerical ballpark answer to this.

Well, you kind of answered this question yourself but you're still hoping there is an objective answer ;)

There can't be any kind of objective answer (whether precise or not) because it depends on the data, i.e. relationships between features and class. It's easy to find examples of simple problems where only a couple features is enough to achieve perfect performance, and it's also easy to create a dataset of millions of random features which can't even reach mediocre performance.

good results (90% accuracy)

Similar point about performance: there are tasks where 90% accuracy is mediocre and there are tasks where 60% is excellent, it depends how hard the problem is (i.e. how easy it is to find the patterns in the data which help predict the answer).

I'm asking because I have a dataset with around 280 features and want to understand if this is way too few features to use with Naive Bayes.

Definitely not too few, as per my previous observations. But it also depends on how many instances there are, in particular the ratio features/instances. If there are too few instances, the model is going to overfit badly with NB.

my data is imbalanced and I believe this may be responsible for the high accurac

Good observation: accuracy is not an appropriate evaluation measure for imbalanced data. The reason is simple: if the majority class represents say 86% of the instances, the classifier can just label all the instances with this class and obtain 86% accuracy, even though it does nothing useful. You should use precision, recall and F-score instead (based on the minority class).