6
votes

I'm trying to build a dictionary of words using tf-idf. However, intuitively it doesn't make sense.

If the inverse document frequency (idf) part of tf-idf calculates the relevance of a term with respect to entire corpus, then that implies some of the important words might have a lower relevance.

If we look at a corpus of legal documents, a term like "license" or "legal" might occur in every document. Due to idf, the score for these terms will be very low. However, intuitively speaking, these terms should have a higher score since these are clearly legal terms.

Is tf-idf a bad approach for building a dictionary of terms?

1

1 Answers

5
votes

Yes, those terms are legal terms. However, tf-idf doesn't try to evaluate whether they are relevant for a specific domain. They help you in shattering documents from that domain. If a term like "legal" occurs in every document they wouldn't help a classifier to tell these documents apart. However, if you mix your legal documents with a random set of documents. You would discover that they suddenly get extremely relevant. Exactly because they would allow you to tell legal documents and the other documents apart.

In practice, they are more typically used to remove "kind-of" stop words. For example, "the" occurs in every document and doesn't carry any meaning.

Whether tf-idf is good for building a dictionary depends very much on what you want to do afterward with this dictionary.