In tf-idf why do we normalize by document frequency and not average term frequency across all documents in the corpus?

Question

Average term frequency would be the average frequency that term appears in other documents. Intuitively I want to compare how frequently it appears in this document relative to the other documents in the corpus.

An example:

d1 has the word "set" 100 times, d2 has the word "set" 1 time, d3 has the word "set" 1 time, d4-N does not have the word set
d1 has the word "theory" 100 times, d2 has the word "theory" 100 times, d3 has the word "theory" 100 times, d4-N does not have the word set

Document 1 has the same tf-idf for the word "set" and the word "theory" even though the word set is more important to d1 than theory.

Using average term frequency would distinguish these two examples. Is tf-iatf (inverse average term frequency) a valid approach? To me it would give me more important keywords, rather than just "rare" and "unique" keywords. If idf is "an estimate of how rare that word is" wouldn't iatf be a better estimate? It seems only marginally harder to implement (especially if the data is pre-processed).

I am thinking of running an experiment and manually analyzing the highest ranked keywords with each measure, but wanted to pass it by some other eyes first.

A follow-up question: Why is tf-idf used so frequently as opposed to alternative methods like this which MAY be more accurate? (If this is a valid approach that is).

Update: Ran an experiment where I manually analyzed the scores and corresponding top words for a few dozen documents, and it seems like iatf and inverse collection frequency (the standard approach to what I described) have super similar results.

Debasis Debasis · Accepted Answer · 2016-02-19T01:07:52

Tf-idf is not meant to compare the importance of a word in a document across two corpora. It is rather meant to distinguish the importance of a word within a document in relation to the distribution of the same term in the other documents of the same collection (not across collections).

A standard approach that you can apply for your case is: collection frequency, cf(t), instead of document frequency, df(t).

cf(t) measures how many times does a term t occurs in the corpus. cf(t) divided by the total collection size would give you the probability of sampling t from the collection.

And then you can compute a linear combination of tf(t,d) and cf(t) values, which gives you the probability of sampling a term t either from a document or from the collection.

P(t,d) = \lambda P(t|d) + (1-\lambda) P(t|Collection)

This is known by the name of Jelinek Mercer smoothed Language Model.

For your example (letting \lambda=0.5):

Corpus 1: P("set",d1) = 0.5*100/100 + 0.5*100/102

Corpus 2: P("set",d1) = 0.5*100/100 + 0.5*100/300

Clearly, P("set",d1) for corpus 2 is less (almost one-third) of that in corpus 1.

In tf-idf why do we normalize by document frequency and not average term frequency across all documents in the corpus?

1 Answers