1
votes

Given the sentence: "the quick brown fox jumped over the lazy dog", I would like to get a score of how frequent each word is from an nltk corpus (which ever corpus is most generic/comprehensive)

EDIT:

This question is in relation to this question: python nltk keyword extraction from sentence where @adi92 suggested using the technique of idf to calculate the 'rareness' of a word. I would like to see what this would look like in practice. The broader problem here is, how do you calculate the rareness of a word's use in the english language. I appreciate that this is a hard problem to solve, but nonetheless nltk idf (with something like the brown or reuters corpus??) might get us part of the way there?

1
Do you have a specific problem, or do you just want someone to do your job / homework for you?millimoose
In the test corpus you supplied, there are two occurrences of "the" and one each of "quick", "brown", "fox", "jumped", "over", "lazy", and "dogs". If this corpus is not good enough, please document your requirements.tripleee
I linked the word idf to the wikipedia article about the same in that answer, which explains the formula used to derive a word's idf. The idf of a word is the logarithm of (# of documents in your corpus)/(# of documents in your corpus which mention that particular word at least once) ... what exactly are you confused about relating to this formula and need an explanation for?Aditya Mukherji
just looking for an example with nltkwaigani

1 Answers

1
votes

If you want to know word frequencies you need a table of word frequencies. Words have different frequencies depending on text genre, so the best frequency table might be based on a domain-specific corpus.

If you're just messing around, it's easy enough to pick a corpus at random and count the words-- use <corpus>.words() and the nltk's FreqDist, and/or see the NLTK book for details.

But for serious use, don't bother counting words yourself: If you're not interested in a specific domain, grab a large word frequency table. There are gazillions out there (it's evidently the first thing a corpus creator thinks of), and the largest one is probably the "1-gram" tables compiled by google. You can download them at http://books.google.com/ngrams/datasets