How does cosine similarity used with K-means algorithm?

Question

For three text document vectors having different length in their vectors in VSM where entries are tf-idf of terms:

Q1: how cosine similarity used by k-means does then how the clusters are constructed.

Q2: when I use TF-IDF algo. Its produce a negative values is there any problem in my calculation?

Please use the following docs vectors is VSM (tf.idf) where all have different vector length for explanation purposes.

Doc1 (0.134636045, -0.000281926, -0.000281926, -0.000281926, -0.000281926, 0)
Doc2 (-0.002354898, 0.012411358, 0.012411358, 0.09621575, 0.3815553)
Doc3(-0.001838258, 0.009688438, 0.019376876, 0.05633028, 0.59569238, 0.103366223, 0)

i will thank any one can give explanation about my question.

I'm voting to close this question as off-topic because this question appears rooted in mathematics rather than programming. This question might be on topic on some other math related SE sites such as MathOverflow or Mathematics, though do your own research for topicality before posting there. — HPierce

Malcolm McLean Malcolm McLean · Accepted Answer · 2017-02-07T17:48:10

Cosine similarity means you take the dot product of the vector / k mean centre rather than the Euclidean distance.

Dot product is a.xb.x + a.yb.y ... + a.zz*b.zz for all the dimensions. You generally normalize the vectors first. Then call acos() on the result.

Essentially you're dividing the results into sectors rather than into randomly-clumped clusters.

How does cosine similarity used with K-means algorithm?

1 Answers