I'm trying to identify important terms in a set of government documents. Generating the term frequencies is no problem.
For document frequency, I was hoping to use the handy Python scripts and accompanying data that Peter Norvig posted for his chapter in "Beautiful Data", which include the frequencies of unigrams in a huge corpus of data from the Web.
My understanding of tf-idf, however, is that "document frequency" refers to the number of documents containing a term, not the number of total words that are this term, which is what we get from the Norvig script. Can I still use this data for a crude tf-idf operation?
Here's some sample data:
word tf global frequency
china 1684 0.000121447
the 352385 0.022573582
economy 6602 0.0000451130774123
and 160794 0.012681757
iran 2779 0.0000231482902018
romney 1159 0.000000678497795593
Simply dividing tf by gf gives "the" a higher score than "economy," which can't be right. Is there some basic math I'm missing, perhaps?
gf
is in fact already inverse, right? So when you say dividingtf
bygf
you actually mean multiplyingtf
withgf
, right? – jogojapangf
. That should give reasonable results then (although that division is of course unnecessary, because the only thing it does is to introduce a constant factor). And actually, dividingtf
bygf
from your table gives approx. 15,610,504 for 'and', but 146,343,374 for 'economy'. What's so bad about that? – jogojapan