11
votes

I've been using the Ruby Classifier library to classify privacy policies. I've come to the conclusion that the simple bag-of-words approach built into this library is not enough. To increase my classification accuracy, I want to train the classifier on n-grams in addition to individual words.

I was wondering whether there's a library out there for preprocessing documents to get relevant n-grams (and properly deal with punctuation). One thought was that I could preprocess the documents and feed pseudo-ngrams into the Ruby Classifier like:

wordone_wordtwo_wordthree

Or maybe there's a better way to be doing this, such as a library that has ngram based Naive Bayes Classification built into it from the getgo. I'm open to using languages other than Ruby here if they get the job done (Python seems like a good candidate if need be).

2

2 Answers

12
votes

If you're ok with python, I'd say nltk would be perfect for you.

For example:

>>> import nltk
>>> s = "This is some sample data.  Nltk will use the words in this string to make ngrams.  I hope that this is useful.".split()
>>> model = nltk.NgramModel(2, s)
>>> model._ngrams
set([('to', 'make'), ('sample', 'data.'), ('the', 'words'), ('will', 'use'), ('some', 'sample'), ('', 'This'), ('use', 'the'), ('make', 'ngrams.'), ('ngrams.', 'I'), ('hope', 'that'
), ('is', 'some'), ('is', 'useful.'), ('I', 'hope'), ('this', 'string'), ('Nltk', 'will'), ('words', 'in'), ('this', 'is'), ('data.', 'Nltk'), ('that', 'this'), ('string', 'to'), ('
in', 'this'), ('This', 'is')])

You even have a method nltk.NaiveBayesClassifier

3
votes
>> s = "She sells sea shells by the sea shore"
=> "She sells sea shells by the sea shore"
>> s.split(/ /).each_cons(2).to_a.map {|x,y| x + ' ' +  y}
=> ["She sells", "sells sea", "sea shells", "shells by", "by the", "the sea", "sea shore"]

Ruby enumerables have a method called enum_cons which will return every of n consecutive items from the enumerable. With that method generating ngrams is a simple one liner.