Average term frequency would be the average frequency that term appears in other documents. Intuitively I want to compare how frequently it appears in this document relative to the other documents in the corpus.
An example:
- d1 has the word "set" 100 times, d2 has the word "set" 1 time, d3 has the word "set" 1 time, d4-N does not have the word set
- d1 has the word "theory" 100 times, d2 has the word "theory" 100 times, d3 has the word "theory" 100 times, d4-N does not have the word set
Document 1 has the same tf-idf for the word "set" and the word "theory" even though the word set is more important to d1 than theory.
Using average term frequency would distinguish these two examples. Is tf-iatf (inverse average term frequency) a valid approach? To me it would give me more important keywords, rather than just "rare" and "unique" keywords. If idf is "an estimate of how rare that word is" wouldn't iatf be a better estimate? It seems only marginally harder to implement (especially if the data is pre-processed).
I am thinking of running an experiment and manually analyzing the highest ranked keywords with each measure, but wanted to pass it by some other eyes first.
A follow-up question: Why is tf-idf used so frequently as opposed to alternative methods like this which MAY be more accurate? (If this is a valid approach that is).
Update: Ran an experiment where I manually analyzed the scores and corresponding top words for a few dozen documents, and it seems like iatf and inverse collection frequency (the standard approach to what I described) have super similar results.