Training Naive Bayes Classifier on ngrams

Question

I've been using the Ruby Classifier library to classify privacy policies. I've come to the conclusion that the simple bag-of-words approach built into this library is not enough. To increase my classification accuracy, I want to train the classifier on n-grams in addition to individual words.

I was wondering whether there's a library out there for preprocessing documents to get relevant n-grams (and properly deal with punctuation). One thought was that I could preprocess the documents and feed pseudo-ngrams into the Ruby Classifier like:

wordone_wordtwo_wordthree

Or maybe there's a better way to be doing this, such as a library that has ngram based Naive Bayes Classification built into it from the getgo. I'm open to using languages other than Ruby here if they get the job done (Python seems like a good candidate if need be).

Nolen Royalty Nolen Royalty · Accepted Answer · 2012-04-09T20:21:11

If you're ok with python, I'd say nltk would be perfect for you.

For example:

>>> import nltk
>>> s = "This is some sample data.  Nltk will use the words in this string to make ngrams.  I hope that this is useful.".split()
>>> model = nltk.NgramModel(2, s)
>>> model._ngrams
set([('to', 'make'), ('sample', 'data.'), ('the', 'words'), ('will', 'use'), ('some', 'sample'), ('', 'This'), ('use', 'the'), ('make', 'ngrams.'), ('ngrams.', 'I'), ('hope', 'that'
), ('is', 'some'), ('is', 'useful.'), ('I', 'hope'), ('this', 'string'), ('Nltk', 'will'), ('words', 'in'), ('this', 'is'), ('data.', 'Nltk'), ('that', 'this'), ('string', 'to'), ('
in', 'this'), ('This', 'is')])

You even have a method nltk.NaiveBayesClassifier

Training Naive Bayes Classifier on ngrams

2 Answers