Scipy, tf-idf and cosine similarity

Question

I'm trying to cluster some documents according to a tf-idf matrix using python.

First I follow the wikipedia definition of the formula, using normalised tf. http://en.wikipedia.org/wiki/Tf-idf

Feat_vectors starts as a two dimensional numpy array, with the rows representing documents and the columns representing terms, the values in each cell being the number of occurrences of each term in each document.

import numpy as np

feat_vectors /= np.max(feat_vectors,axis=1)[:,np.newaxis]
idf = len(feat_vectors) / (feat_vectors != 0).sum(0)
idf = np.log(idf)
feat_vectors *= idf

I then cluster these vectors using scipy:

from scipy.cluster import hierarchy

clusters = hierarchy.linkage(feat_vectors,method='complete',metric='cosine')
flat_clusters = hierarchy.fcluster(clusters, 0.8,'inconsistent')

However, on that last line it throws an error:

ValueError: Linkage 'Z' contains negative distances.

Cosine similarity goes from -1 to 1. However, the wikipedia page for cosine similarity states http://en.wikipedia.org/wiki/Cosine_similarity :

In the case of information retrieval, the cosine similarity of two documents will range >from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative.

So if I am getting a negative similarity, it seems that I am making some error in calculating tf-idf. Any ideas what my mistake is?

Looks like your feat_vectors has negative values. Either before multiplying by idf, or idf has values lower than 1 before you take np.log. — tiago
The minimum value in the matrix is zero. It's just that the result of the cosine similarity is <0. — Fergusmac

Ben Allison Ben Allison · Accepted Answer · 2012-12-05T14:56:56

I suspect the error is in the following line:

idf = len(feat_vectors) / (feat_vectors != 0).sum(0)

since your logical vector is going to be converted to an int in the sum, and len is an int, you're losing precision. Replacing with:

idf = float(len(feat_vectors)) / (feat_vectors != 0).sum(0)

worked for me (i.e. produced what I was expecting with dummy data). Everything else looks correct.

Scipy, tf-idf and cosine similarity

2 Answers