0
votes

For three text document vectors having different length in their vectors in VSM where entries are tf-idf of terms:

Q1: how cosine similarity used by k-means does then how the clusters are constructed.

Q2: when I use TF-IDF algo. Its produce a negative values is there any problem in my calculation?

Please use the following docs vectors is VSM (tf.idf) where all have different vector length for explanation purposes.

Doc1 (0.134636045, -0.000281926, -0.000281926, -0.000281926, -0.000281926, 0)
Doc2 (-0.002354898, 0.012411358, 0.012411358, 0.09621575, 0.3815553)
Doc3(-0.001838258, 0.009688438, 0.019376876, 0.05633028, 0.59569238, 0.103366223, 0) 

i will thank any one can give explanation about my question.

1
I'm voting to close this question as off-topic because this question appears rooted in mathematics rather than programming. This question might be on topic on some other math related SE sites such as MathOverflow or Mathematics, though do your own research for topicality before posting there.HPierce

1 Answers

0
votes

Cosine similarity means you take the dot product of the vector / k mean centre rather than the Euclidean distance.

Dot product is a.xb.x + a.yb.y ... + a.zz*b.zz for all the dimensions. You generally normalize the vectors first. Then call acos() on the result.

Essentially you're dividing the results into sectors rather than into randomly-clumped clusters.