I am trying to find out how to extract the collocates of a specific word out of a text. As in: what are the words that make a statistically significant collocation with e.g. the word "hobbit" in the entire text corpus? I am expecting a result similar to a list of words (collocates ) or maybe tuples (my word + its collocate).
I know how to make bi- and tri-grams using nltk, and also how to select only the bi- or trigrams that contain my word of interest. I am using the following code (adapted from this StackOverflow question).
import nltk
from nltk.collocations import *
corpus = nltk.Text(text) # "text" is a list of tokens
trigram_measures = nltk.collocations.TrigramAssocMeasures()
tri_finder = TrigramCollocationFinder.from_words(corpus)
# Only trigrams that appear 3+ times
tri_finder.apply_freq_filter(3)
# Only the ones containing my word
my_filter = lambda *w: 'Hobbit' not in w
tri_finder.apply_ngram_filter(my_filter)
print tri_finder.nbest(trigram_measures.likelihood_ratio, 20)
This works fine and gives me a list of trigrams (one element of of which is my word) each with their log-likelihood value. But I don't really want to select words only from a list of trigrams. I would like to make all possible N-Gram combinations in a window of my choice (for example, all words in a window of 3 left and 3 right from my word - that would mean a 7-Gram), and then check which of those N-gram words has a statistically relevant frequency paired with my word of interest. I would like to take the Log-Likelihood value for that.
My idea would be:
1) Calculate all N-Gram combinations in different sizes containing my word (not necessarily using nltk, unless it allows to calculate units larger than trigrams, but i haven't found that option),
2) Compute the log-likelihood value for each of the words composing my N-grams, and somehow compare it against the frequency of the n-gram they appear in (?). Here is where I get lost a bit... I am not experienced in this and I don't know how to think this step.
Does anyone have suggestions how I should do? And assuming I use the pool of trigrams provided by nltk for now: does anyone have ideas how to proceed from there to get a list of the most relevant words near my search word?
Thank you