I've been using the Ruby Classifier library to classify privacy policies. I've come to the conclusion that the simple bag-of-words approach built into this library is not enough. To increase my classification accuracy, I want to train the classifier on n-grams in addition to individual words.
I was wondering whether there's a library out there for preprocessing documents to get relevant n-grams (and properly deal with punctuation). One thought was that I could preprocess the documents and feed pseudo-ngrams into the Ruby Classifier like:
wordone_wordtwo_wordthree
Or maybe there's a better way to be doing this, such as a library that has ngram based Naive Bayes Classification built into it from the getgo. I'm open to using languages other than Ruby here if they get the job done (Python seems like a good candidate if need be).