Cosine Similarity

6

votes

I calculated tf/idf values of two documents. The following are the tf/idf values:

1.txt
0.0
0.5
2.txt
0.0
0.5

The documents are like:

1.txt = > dog cat
2.txt = > cat elephant

How can I use these values to calculate cosine similarity?

I know that I should calculate the dot product, then find distance and divide dot product by it. How can I calculate this using my values?

One more question: Is it important that both documents should have same number of words?

Isn't this more appropriate for mathoverflow.net ? - Nicolás

its an information retrieval task, not something a pure math person would care about - Aditya Mukherji

Please stop recommending mathoverflow.net -- it's for serious mathematical questions. - Jason S

16

votes

            a * b
sim(a,b) =--------
           |a|*|b|

a*b is dot product

some details:

def dot(a,b):
  n = length(a)
  sum = 0
  for i in xrange(n):
    sum += a[i] * b[i];
  return sum

def norm(a):
  n = length(a)
  for i in xrange(n):
    sum += a[i] * a[i]
  return math.sqrt(sum)

def cossim(a,b):
  return dot(a,b) / (norm(a) * norm(b))

yes. to some extent, a and b must have the same length. but a and b usually have sparse representation, you only need to store non-zero entries and you can calculate norm and dot more fast.

9

votes

simple java code implementation:

  static double cosine_similarity(Map<String, Double> v1, Map<String, Double> v2) {
            Set<String> both = Sets.newHashSet(v1.keySet());
            both.retainAll(v2.keySet());
            double sclar = 0, norm1 = 0, norm2 = 0;
            for (String k : both) sclar += v1.get(k) * v2.get(k);
            for (String k : v1.keySet()) norm1 += v1.get(k) * v1.get(k);
            for (String k : v2.keySet()) norm2 += v2.get(k) * v2.get(k);
            return sclar / Math.sqrt(norm1 * norm2);
    }

1

votes

1) Calculate tf-idf ( Generally better than t-f alone but completely depends on your data set and requirement)

From wiki ( regarding idf )

An inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

2) No , it is not important that both the documents have same number of words.

3) You can find tf-idf or cosine-similarity in any language now days by invoking some machine learning library function. I prefer python

Python code to calculate tf-idf and cosine-similarity ( using scikit-learn 0.18.2 )

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# example dataset
from sklearn.datasets import fetch_20newsgroups

# replace with your method to get data
example_data = fetch_20newsgroups(subset='all').data

max_features_for_tfidf = 10000
is_idf = True 

vectorizer = TfidfVectorizer(max_df=0.5, max_features=max_features_for_tf_idf,
                             min_df=2, stop_words='english',
                             use_idf=is_idf)


X_Mat = vectorizer.fit_transform(example_data)

# calculate cosine similarity between samples in X with samples in Y
cosine_sim = cosine_similarity(X=X_Mat, Y=X_Mat)

4) You might be interested in truncated Singular Value Decomposition (SVD)

Cosine Similarity

3 Answers