1
votes

I have found Okapi Similarity measure can be used to calculated document similarity from here http://www2002.org/CDROM/refereed/643/node6.html and from this paper http://singhal.info/ieee2001.pdf

I want to calculate similarity between documents of a document collection using Okapi similarity scheme with Lucene

e.g. I have 10 documents (doc #A,#B, #C, #D etc.) in my document collection. I ll pick a document as query document. Say doc #A. Then for each term=1..n , of query document I ll calculate the

idfOfQueryTerm = log (totalNumIndexedDocs - docFreq + 0.5)/(docFreq + 0.5)

then I ll take the sum of (idfOfQueryTerm) from 1 to n; idfOfQueryDoc= sum of (idfOfQueryTerm) Then for each 10 documents(Including query doc), I l calculate total term frequency of document by this equation, based on the query terms of the query document that was selected first.

tfOfDocument={2.2 * termFrq }/ { 1.2 * ( 0.25 + 0.75 * docLength / this.avgDocLength ) + termFrq }

So I ll end up with 10-tfOfDocument values, one for each document and one idfOfQueryDoc value.

Then I can calculate the similarity between query document and other documents using these two methods.

1) Similarity between query doc and doc #B= idfOfQueryDoc* tfOfDocument #B

2) Similarity between query doc and doc #B= idfOfQueryDoc* tfOfDocument #B* tfOfDocument#queryDoc

I want to know, whether my understanding of Okapi Similarity measure is correct?

Which method of above two will be optimal for calculating the doc similarity?

1

1 Answers

2
votes

Based on the first link, the similarity between the query document and another document is:

sim(query, doc) = sum(t in terms(query), freq(t, query) * w(t, doc))

where (from the second link, slightly modified as I think the formula in the link is incorrect)

w(t, doc) = idf(t) * (k+1)*freq(t, doc) / (k*(1-b + b*ls(doc)) + freq(t, doc))
ls(doc) = len(doc)/avgdoclen

and idf(t) is your idfOfQueryTerm, freq(t, doc) is the frequency of term t in document doc.

Choosing b=0.25 and k = 1.2 you get

w(t, doc) = idf(t) * 2.2*freq(t, doc) / (1.2*(0.25+0.75*ls(doc)) + freq(t, doc))

Note: the two links give slightly different equations, although the differene is mostly in weighing, not the fundamentals