I’m indexing a collection of documents using Lucene by specifying TermVector at indexing time. Then I retrieve terms and their frequencies by reading the index and calculating TF-IDF score vectors for each document. Then, using the TF-IDF vectors, I calculate pairwise cosine similarity between documents using Wikipedia's cosine similarity equation.
This is my problem: Say I have two identical documents “A” and “B” in this collection (A and B have more than 200 sentences). If I calculate pairwise cosine similarity between A and B it gives me cosine value=1 which is perfectly OK. But if I remove a single sentence from Doc “B”, it gives me cosine similarity value around 0.85 between these two documents. The documents are almost similar but cosine values are not. I understand the problem is with the equation that I’m using.
Is there better way / equation that I can use for calculating cosine similarity between documents?
Edited
This is how I calculate Cosine Similarity, doc1[]
and doc2[]
are TF-IDF vectors for corresponding document. the vector contains only the scores
but not the words
private double cosineSimBetweenTwoDocs(float doc1[], float doc2[]) {
double temp;
int doc1Len = doc1.length;
int doc2Len = doc2.length;
float numerator = 0;
float temSumDoc1 = 0;
float temSumDoc2 = 0;
double equlideanNormOfDoc1 = 0;
double equlideanNormOfDoc2 = 0;
if (doc1Len > doc2Len) {
for (int i = 0; i < doc2Len; i++) {
numerator += doc1[i] * doc2[i];
temSumDoc1 += doc1[i] * doc1[i];
temSumDoc2 += doc2[i] * doc2[i];
}
equlideanNormOfDoc1=Math.sqrt(temSumDoc1);
equlideanNormOfDoc2=Math.sqrt(temSumDoc2);
} else {
for (int i = 0; i < doc1Len; i++) {
numerator += doc1[i] * doc2[i];
temSumDoc1 += doc1[i] * doc1[i];
temSumDoc2 += doc2[i] * doc2[i];
}
equlideanNormOfDoc1=Math.sqrt(temSumDoc1);
equlideanNormOfDoc2=Math.sqrt(temSumDoc2);
}
temp = numerator / (equlideanNormOfDoc1 * equlideanNormOfDoc2);
return temp;
}