tf-idf using data on unigram frequency from Google

Question

I'm trying to identify important terms in a set of government documents. Generating the term frequencies is no problem.

For document frequency, I was hoping to use the handy Python scripts and accompanying data that Peter Norvig posted for his chapter in "Beautiful Data", which include the frequencies of unigrams in a huge corpus of data from the Web.

My understanding of tf-idf, however, is that "document frequency" refers to the number of documents containing a term, not the number of total words that are this term, which is what we get from the Norvig script. Can I still use this data for a crude tf-idf operation?

Here's some sample data:

word    tf       global frequency
china   1684     0.000121447
the     352385   0.022573582
economy 6602     0.0000451130774123
and     160794   0.012681757
iran    2779     0.0000231482902018
romney  1159     0.000000678497795593

Simply dividing tf by gf gives "the" a higher score than "economy," which can't be right. Is there some basic math I'm missing, perhaps?

Interesting question. For my understanding: What you refer to as gf is in fact already inverse, right? So when you say dividing tf by gf you actually mean multiplying tf with gf, right? — jogojapan
I don't believe gf is inverse. "The" makes up 2.2 percent of all words in the giant corpus, while "and" is 1.2 percent and "china" is 0.012 percent. — Chris Wilson
Oh, so you have divided the global count by the total word count to obtain gf. That should give reasonable results then (although that division is of course unnecessary, because the only thing it does is to introduce a constant factor). And actually, dividing tf by gf from your table gives approx. 15,610,504 for 'and', but 146,343,374 for 'economy'. What's so bad about that? — jogojapan

Atilla Ozgur Atilla Ozgur · Accepted Answer · 2013-07-17T21:15:04

As I understand, Global Frequency is equal "inverse total term frequency" mentioned here Robertson. From this Robertson's paper:

One possible way to get away from this problem would be to make a fairly radical re-
placement for IDF (that is, radical in principle, although it may be not so radical 
in terms of its practical effects). ....
the probability from the event space of documents to the event space of term positions 
in the concatenated text of all the documents in the collection. 
Then we have a new measure, called here 
inverse total term frequency:
...
On the whole, experiments with inverse total term frequency weights have tended to show
that they are not as effective as IDF weights

According to this text, you can use inverse global frequency as IDF term, albeit more crude than standard one.

Also you are missing stop words removal. Words such as the are used in almost all documents, therefore they do not give any information. Before tf-idf , you should remove such stop words.

tf-idf using data on unigram frequency from Google

1 Answers