I found a python tutorial on the web for calculating tf-idf and cosine similarity. I am trying to play with it and change it a bit.
The problem is that I have weird results and almost without any sense.
For example I am using 3 documents. [doc1,doc2,doc3] doc1 and doc2 are similars and doc3 are totaly different.
The results are here:
[[  0.00000000e+00   2.20351188e-01   9.04357868e-01]
 [  2.20351188e-01  -2.22044605e-16   8.82546765e-01]
 [  9.04357868e-01   8.82546765e-01  -2.22044605e-16]]
First, I thought that the numbers on the main diagonal should be 1 and not 0. After that, the similarity score for doc1 and doc2 is around 0.22 and doc1 with doc3 around 0.90. I expected the opposite results. Could you please check my code and maybe help me understand why I have those results?
Doc1, doc2 and doc3 are tokkenized texts.
articles = [doc1,doc2,doc3]
corpus = []
for article in articles:
    for word in article:
        corpus.append(word)
def freq(word, article):
    return article.count(word)
def wordCount(article):
    return len(article)
def numDocsContaining(word,articles):
  count = 0
  for article in articles:
    if word in article:
      count += 1
  return count
def tf(word, article):
    return (freq(word,article) / float(wordCount(article)))
def idf(word, articles):
    return math.log(len(articles) / (1 + numDocsContaining(word,articles)))
def tfidf(word, document, documentList):
    return (tf(word,document) * idf(word,documentList))
feature_vectors=[]
for article in articles:
    vec=[]
    for word in corpus:
        if word in article:
            vec.append(tfidf(word, article, corpus))
        else:
            vec.append(0)
    feature_vectors.append(vec)
n=len(articles)
mat = numpy.empty((n, n))
for i in xrange(0,n):
    for j in xrange(0,n):
       mat[i][j] = nltk.cluster.util.cosine_distance(feature_vectors[i],feature_vectors[j])
print mat