I have found Okapi Similarity measure can be used to calculated document similarity from here http://www2002.org/CDROM/refereed/643/node6.html and from this paper http://singhal.info/ieee2001.pdf
I want to calculate similarity between documents of a document collection using Okapi similarity scheme with Lucene
e.g. I have 10 documents (doc #A,#B, #C, #D etc.) in my document collection. I ll pick a document as query document. Say doc #A. Then for each term=1..n , of query document I ll calculate the
idfOfQueryTerm = log (totalNumIndexedDocs - docFreq + 0.5)/(docFreq + 0.5)
then I ll take the sum of (idfOfQueryTerm) from 1 to n
; idfOfQueryDoc= sum of (idfOfQueryTerm)
Then for each 10 documents(Including query doc), I l calculate total term frequency of document by this equation, based on the query terms of the query document that was selected first.
tfOfDocument={2.2 * termFrq }/ { 1.2 * ( 0.25 + 0.75 * docLength / this.avgDocLength ) + termFrq }
So I ll end up with 10-tfOfDocument
values, one for each document and one idfOfQueryDoc
value.
Then I can calculate the similarity between query document and other documents using these two methods.
1) Similarity between query doc and doc #B= idfOfQueryDoc* tfOfDocument #B
2) Similarity between query doc and doc #B= idfOfQueryDoc* tfOfDocument #B* tfOfDocument#queryDoc
I want to know, whether my understanding of Okapi Similarity measure is correct?
Which method of above two will be optimal for calculating the doc similarity?