9
votes

i have searched the web about normalizing tf grades on cases when the documents' lengths are very different (for example, having the documents lengths vary from 500 words to 2500 words)

the only normalizing i've found talk about dividing the term frequency in the length of the document, hence causing the length of the document to not have any meaning.

this method though is a really bad one for normalizing tf. if any, it causes the tf grades for each document to have a very large bias (unless all documents are constructed from pretty much the same dictionary, which is not the case when using tf-idf)

for example lets take 2 documents - one consisting of 100 unique words, and the other of 1000 unique words. each word in doc1 will have a tf of 0.01 while in doc2 each word will have a tf of 0.001

this causes tf-idf grades to automatically be bigger when matching words with doc1 than doc2

have anyone got any suggustion of a more suitable normalizing formula?

thank you

edit i also saw a method stating we should divide the term frequency with the maximum term frequency of the doc for each doc this also isnt solving my problem

what i was thinking, is calculating the maximum term frequency from all the documents and then normalizing all of the terms by dividing each term frequency with the maximum

would love to know what you think

1

1 Answers

15
votes

What is the goal of your analysis?

If your end goal is to compare similarity between documents (et simila) you should not bother about document length at the tfidf calculation stage. Here is why.

The tfidf represents your documents in a common vector space. If you then calculate the cosine similarity between these vectors, the cosine similarity compensates for the effect of different documents' length. The reason is that the cosine similarity evaluates the orientation of the vectors and not their magnitude. I can show you the point with python: Consider the following (dumb) documents

document1 = "apple apple banana"
document2 = "apple apple apple apple banana banana"

documents = (
    document1,
    document2)

The length of these documents is different but their content is identical. More precisely, the relative distributions of terms in the two documents is identical but the absolute term frequencies are not.

Now, we use tfidf to represent these documents in a common vector space:

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

And we use the cosine similarity to evaluate the similarity of these vectorized documents by looking just at their directions (or orientations) without caring about their magnitudes (that is, their length). I am evaluating cosine similarity between document one and document two:

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])

The result is 1. Remember that the cosine similarity between two vectors equals 1 when the two vectors have exactly the same orientation, 0 when they are orthogonal and -1 when the vectors have the opposite orientation.

In this case, you can see that cosine similarity is not affected by the length of the documents and is capturing the fact that the relative distribution of terms in you original documents is identical! If you want to express this information as a "distance" between documents, then you can simply do:

1 - cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])

This value will tend to 0 when the documents are similar (regardless of their length) and to 1 when they are dissimilar.