Better way of calculating document Similarity using Lucene

Question

I’m indexing a collection of documents using Lucene by specifying TermVector at indexing time. Then I retrieve terms and their frequencies by reading the index and calculating TF-IDF score vectors for each document. Then, using the TF-IDF vectors, I calculate pairwise cosine similarity between documents using Wikipedia's cosine similarity equation.

This is my problem: Say I have two identical documents “A” and “B” in this collection (A and B have more than 200 sentences). If I calculate pairwise cosine similarity between A and B it gives me cosine value=1 which is perfectly OK. But if I remove a single sentence from Doc “B”, it gives me cosine similarity value around 0.85 between these two documents. The documents are almost similar but cosine values are not. I understand the problem is with the equation that I’m using.

Is there better way / equation that I can use for calculating cosine similarity between documents?

Edited

This is how I calculate Cosine Similarity, doc1[] and doc2[] are TF-IDF vectors for corresponding document. the vector contains only the scores but not the words

private double cosineSimBetweenTwoDocs(float doc1[], float doc2[]) {
    double temp;
    int doc1Len = doc1.length;
    int doc2Len = doc2.length;
    float numerator = 0;
    float temSumDoc1 = 0;
    float temSumDoc2 = 0;
    double equlideanNormOfDoc1 = 0;
    double equlideanNormOfDoc2 = 0;
    if (doc1Len > doc2Len) {
        for (int i = 0; i < doc2Len; i++) {
            numerator += doc1[i] * doc2[i];
            temSumDoc1 += doc1[i] * doc1[i];
            temSumDoc2 += doc2[i] * doc2[i];
        }
        equlideanNormOfDoc1=Math.sqrt(temSumDoc1);
         equlideanNormOfDoc2=Math.sqrt(temSumDoc2);
    } else {
        for (int i = 0; i < doc1Len; i++) {
            numerator += doc1[i] * doc2[i];
            temSumDoc1 += doc1[i] * doc1[i];
            temSumDoc2 += doc2[i] * doc2[i];
        }
         equlideanNormOfDoc1=Math.sqrt(temSumDoc1);
         equlideanNormOfDoc2=Math.sqrt(temSumDoc2);
    }

    temp = numerator / (equlideanNormOfDoc1 * equlideanNormOfDoc2);
    return temp;
}

I guess something is wrong about your code. Removing one sentence from 200 sentences should give you a number > 0.98. To verify it, you can generate a random vector, make a modification to the vector and compute the cosine similarity for it to see what you get. For a vector of size 1000, and random numbers in the range [10,100], if I subtract a random number in the range [10,20] from all the numbers in the vector, the resulting similarity measure is always > 0.98 for me. — Helium
I used Mathematica to verify the case. Here is my code: a = RandomInteger[{10, 100}, 1000]; b = a - RandomInteger[{10, 20}, 1000]; {Total[a], Total[b], Total[a - b], N[(a.b)/(Norm[a] Norm[b])]}, and here is the output: {55419, 40271, 15148, 0.98811} — Helium
@Mohsen Removing One sentences from the Vector B will reduce the number of elements in that vector, if we get a vector of size 1000 after removing sentences the size of vector B will become say 995, and now vector A is size of 1000 but, two vectors are not aligned too. By removing a sentence, the vector elements are removed from middle but not from end of the vector. So if you can try by removing vector elements from middle, you can observe 0.85 value — Kasun

Helium Helium · Accepted Answer · 2012-05-18T11:10:00

As I told you in my comment, I think you made a mistake somewhere. The vectors actually contain the <word,frequency> pairs, not words only. Therefore, when you delete the sentence, only the frequency of the corresponding words are subtracted by 1 (the words after are not shifted). Consider the following example:

Document a:

A B C A A B C. D D E A B. D A B C B A.

Document b:

A B C A A B C. D A B C B A.

Vector a:

A:6, B:5, C:3, D:3, E:1

Vector b:

A:5, B:4, C:3, D:1, E:0

Which result in the following similarity measure:

(6*5+5*4+3*3+3*1+1*0)/(Sqrt(6^2+5^2+3^2+3^2+1^2) Sqrt(5^2+4^2+3^2+1^2+0^2))=
62/(8.94427*7.14143)=
0.970648

Edit I think your source code is not working as well. Consider the following code which works fine with the above example:

import java.util.HashMap;
import java.util.Map;

public class DocumentVector {
    Map<String, Integer> wordMap = new HashMap<String, Integer>();

    public void incCount(String word) {
        Integer oldCount = wordMap.get(word);
        wordMap.put(word, oldCount == null ? 1 : oldCount + 1);
    }

    double getCosineSimilarityWith(DocumentVector otherVector) {
        double innerProduct = 0;
        for(String w: this.wordMap.keySet()) {
            innerProduct += this.getCount(w) * otherVector.getCount(w);
        }
        return innerProduct / (this.getNorm() * otherVector.getNorm());
    }

    double getNorm() {
        double sum = 0;
        for (Integer count : wordMap.values()) {
            sum += count * count;
        }
        return Math.sqrt(sum);
    }

    int getCount(String word) {
        return wordMap.containsKey(word) ? wordMap.get(word) : 0;
    }

    public static void main(String[] args) {
        String doc1 = "A B C A A B C. D D E A B. D A B C B A.";
        String doc2 = "A B C A A B C. D A B C B A.";

        DocumentVector v1 = new DocumentVector();
        for(String w:doc1.split("[^a-zA-Z]+")) {
            v1.incCount(w);
        }

        DocumentVector v2 = new DocumentVector();
        for(String w:doc2.split("[^a-zA-Z]+")) {
            v2.incCount(w);
        }

        System.out.println("Similarity = " + v1.getCosineSimilarityWith(v2));
    }

}

Better way of calculating document Similarity using Lucene

1 Answers