i have searched the web about normalizing tf grades on cases when the documents' lengths are very different (for example, having the documents lengths vary from 500 words to 2500 words)
the only normalizing i've found talk about dividing the term frequency in the length of the document, hence causing the length of the document to not have any meaning.
this method though is a really bad one for normalizing tf. if any, it causes the tf grades for each document to have a very large bias (unless all documents are constructed from pretty much the same dictionary, which is not the case when using tf-idf)
for example lets take 2 documents - one consisting of 100 unique words, and the other of 1000 unique words. each word in doc1 will have a tf of 0.01 while in doc2 each word will have a tf of 0.001
this causes tf-idf grades to automatically be bigger when matching words with doc1 than doc2
have anyone got any suggustion of a more suitable normalizing formula?
thank you
edit i also saw a method stating we should divide the term frequency with the maximum term frequency of the doc for each doc this also isnt solving my problem
what i was thinking, is calculating the maximum term frequency from all the documents and then normalizing all of the terms by dividing each term frequency with the maximum
would love to know what you think