1
votes

I am trying to extract collocations using nltk from a corpus an then use their occurrences as features for a scikit-learn classifier. Unfortunately I am not so familiar with nltk and I don't see an easy way to do this. I got this far:

  • extract collocations using BigramCollocationFinder from corpus
  • for each document, extract all bigrams (using nltk.bigrams) and check if they are one of the collocations
  • create a TfidfVectorizer with an analyzer that does nothing
  • feed it the documents in form of the extracted bigrams

That seems pretty overcomplicated to me. Also it has the problem that the BigramCollactionFinder has a window_size parameter for bigrams that span over words. The standard nltk.bigrams extraction can not do that.

A way to overcome this would be to instantiate a new BigramCollocationFinder for each document and extract bigrams again and match them against the ones I found before... but again, that seems way to complicated. Surely there is an easier way to do that, that I overlook.

Thanks for your suggestions!

1

1 Answers

3
votes

larsmans has already contributed a NLTK / scikit-learn feature mapper for simple, non collocation features. That might give you some inspiration for your own problem:

http://nltk.org/_modules/nltk/classify/scikitlearn.html