2
votes

hi i'm trying to calculate the cosine similarity between my query and the documents i return with my information retrieval program in python.

for the cosine similarity i use this implementation:

import math
def cosine_similarity(v1,v2):

    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]; y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    return sumxy/math.sqrt(sumxx*sumyy)

I found this solution on this website, but i'm having some problems. I tf*idf weights and the vector of each document, this is an example of a document vector and a query vector:

D: [0.028239449664633154, 0.05559373180364792, 0.02798439181455718]
Q: [0.3746433655507998, 0.526816791853616, 0.618765996788542] 

Ok, so the problem is that sometimes whet i do the cosine similarity, the result is bigger than 1, how is this possible? Cosine can't be bigger than 1? Is my reasoning correct? Is it correct doing the cosine similarity in this case? Please help me, thanks

1
What input gives you a result greater than 1?jwodder
D:[0.009063952392358061, 0.01055107112621112] Q:[0.5619650483261998, 0.6541664098250894]Dancing Flowerz
butit gives me 1.0000000000000002 as result, and there are documents with higher weights that get an inferior similarityDancing Flowerz
ok but if my query is [draw , paint] with this method i get an high similarity with documents where these terms appear 1 time and in documents they appear 20 times i get a low similarityDancing Flowerz

1 Answers

0
votes

1) Cosine similarity can't be greater than 1.

-1 <= cos_sim <= 1

2) You are getting the result greater than 1 probably because of float data type.

Floating-point numbers are represented in computer hardware as base 2 (binary) fractions.

On a typical machine running Python, there are 53 bits of precision available for a Python float

If Python were to print the true decimal value of the binary approximation stored for 0.1, it would have to display

>>> 0.1
0.1000000000000000055511151231257827021181583404541015625

Go through this link to understand more about floating-point numbers in python.