Am i clustering users correctly by using sklearn's cosine similarity method and K-means algorithm?

Question

I have a movie dataset with more than 200 movies and more than 100 users. The users rated the movies. A value of 1 for good, 0 for bad and blank if no choice.

I want to cluster similar users based on their reviews with the idea that users who rated similar movies as good might also rate a movie as good which was not rated by any user in the same cluster. I used cosine similarity measure with k-means clustering. The csv file is shown below:

  UserID         M1     M2       M3  ...............  M200                          
  user1          1      0                               0     
  user2          0      1        1                                      
  user3          0      1                               1                                                                         
    .
    .
    .
    .
 user100         1      0        1

According to the scheme, if an original review was 1 (good) then we put 1 in the cell and -1 in the cell if the review was 0 (bad). For no reviews, we put 0 in the cell. The csv file below explains the scheme. The rows are users and M in the column is movie and C is the choice.

 UserID      M1C1   M2C1  M3C1 .  . ..............M200C1                            
  user1       1     -1     0                        -1    
  user2      -1      1     1                         0   
  user3      -1      1     0                         1                                                                       
    .
    .
    .
 user100      1     -1     1                         0

I measured cosine similarity and then clustered the users with sklearn's cosine_similarity and kmeans clustering algorithm. The code is:

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans

df = pd.read_csv('input_file.csv', sep=',', encoding='latin-1',  
                  index_col=False)

df = df.set_index('UserID')


pairwise = pd.DataFrame(
         cosine_similarity(df.values),
         columns = df.index.values,
         index = df.index
)

print (pairwise.round(2))

pairs = pairwise.unstack()

pairs.index.rename(['User A', 'User B'], inplace=True)
pairs = pairs.to_frame('cosine distance').reset_index()

A = pairs[
      (pairs['cosine distance'] < 0.00) 
      & (pairs['User A'] != pairs['User B'])
]

print(A)

kmeans = KMeans(n_clusters=2, init ='k-means++', max_iter=50, 
                n_init=5,random_state=0 )
y_kmeans = kmeans.fit_predict(pairwise)

print(y_kmeans)

frame = pd.DataFrame(pairwise)
frame['cluster'] = y_kmeans

print(frame['cluster'])
print(frame['cluster'].value_counts())

With this code, i am getting the cosine similarity for all the pairs and i can filter the pairs based on the value of cosine similarity. I am also getting a list of clusters for the users. I want to know that am i doing it right ? Is it right to calculate the cosine similarity first and then pass the values to kmeans ? As by default, the sklearn's kmeans function uses euclidean distance.

I will really appreciate some help.

Thanks..

Dorian Dorian · Accepted Answer · 2020-07-21T13:29:41

Cosine similarity kernel on sklearn is defined by the dot-product divided by the product of the length of both vectors.

You want to compare 2 vecotrs with each other that describe the rating of each film. So first, you need to get rid of the "second" choice. You should have a value of +1 if you like it and -1 if you don't and 0 if you didn't rate it.
You need the "pairwise" distances of the ratings: user1 to user2, user1 to user3, ...
Try to get an understanding of what those values mean...

(Hint: if two people are very similar, you have a lot of 1x1, that you sum up and normalize in the end. So the more similar they are, the closer to 1 is the cosine similarity. Vice versa for the negative case where you have [1, 1, 1, ...] and [-1, -1, -1, ...], which results in values closer to -1 if they are very unlike.)

Now cluster (?)! Distances closer to -1 are dissimilar, close to 1 are similar.

Am i clustering users correctly by using sklearn's cosine similarity method and K-means algorithm?

1 Answers