Using nltk collocations as features in scikit-learn

Question

I am trying to extract collocations using nltk from a corpus an then use their occurrences as features for a scikit-learn classifier. Unfortunately I am not so familiar with nltk and I don't see an easy way to do this. I got this far:

extract collocations using BigramCollocationFinder from corpus
for each document, extract all bigrams (using nltk.bigrams) and check if they are one of the collocations
create a TfidfVectorizer with an analyzer that does nothing
feed it the documents in form of the extracted bigrams

That seems pretty overcomplicated to me. Also it has the problem that the BigramCollactionFinder has a window_size parameter for bigrams that span over words. The standard nltk.bigrams extraction can not do that.

A way to overcome this would be to instantiate a new BigramCollocationFinder for each document and extract bigrams again and match them against the ones I found before... but again, that seems way to complicated. Surely there is an easier way to do that, that I overlook.

Thanks for your suggestions!

ogrisel ogrisel · Accepted Answer · 2012-09-11T16:32:00

larsmans has already contributed a NLTK / scikit-learn feature mapper for simple, non collocation features. That might give you some inspiration for your own problem:

http://nltk.org/_modules/nltk/classify/scikitlearn.html

Using nltk collocations as features in scikit-learn

1 Answers