I'm trying to cluster some documents according to a tf-idf matrix using python.
First I follow the wikipedia definition of the formula, using normalised tf. http://en.wikipedia.org/wiki/Tf-idf
Feat_vectors starts as a two dimensional numpy array, with the rows representing documents and the columns representing terms, the values in each cell being the number of occurrences of each term in each document.
import numpy as np
feat_vectors /= np.max(feat_vectors,axis=1)[:,np.newaxis]
idf = len(feat_vectors) / (feat_vectors != 0).sum(0)
idf = np.log(idf)
feat_vectors *= idf
I then cluster these vectors using scipy:
from scipy.cluster import hierarchy
clusters = hierarchy.linkage(feat_vectors,method='complete',metric='cosine')
flat_clusters = hierarchy.fcluster(clusters, 0.8,'inconsistent')
However, on that last line it throws an error:
ValueError: Linkage 'Z' contains negative distances.
Cosine similarity goes from -1 to 1. However, the wikipedia page for cosine similarity states http://en.wikipedia.org/wiki/Cosine_similarity :
In the case of information retrieval, the cosine similarity of two documents will range >from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative.
So if I am getting a negative similarity, it seems that I am making some error in calculating tf-idf. Any ideas what my mistake is?
feat_vectors
has negative values. Either before multiplying byidf
, or idf has values lower than 1 before you takenp.log
. – tiago