I am trying to extract collocations using nltk from a corpus an then use their occurrences as features for a scikit-learn classifier. Unfortunately I am not so familiar with nltk and I don't see an easy way to do this. I got this far:
- extract collocations using
BigramCollocationFinder
from corpus - for each document, extract all bigrams (using
nltk.bigrams
) and check if they are one of the collocations - create a
TfidfVectorizer
with an analyzer that does nothing - feed it the documents in form of the extracted bigrams
That seems pretty overcomplicated to me. Also it has the problem that the BigramCollactionFinder
has a window_size
parameter for bigrams that span over words. The standard nltk.bigrams
extraction can not do that.
A way to overcome this would be to instantiate a new BigramCollocationFinder for each document and extract bigrams again and match them against the ones I found before... but again, that seems way to complicated. Surely there is an easier way to do that, that I overlook.
Thanks for your suggestions!