I have a movie dataset with more than 200 movies and more than 100 users. The users rated the movies. A value of 1 for good, 0 for bad and blank if no choice.
I want to cluster similar users based on their reviews with the idea that users who rated similar movies as good might also rate a movie as good which was not rated by any user in the same cluster. I used cosine similarity measure with k-means clustering. The csv file is shown below:
UserID M1 M2 M3 ............... M200
user1 1 0 0
user2 0 1 1
user3 0 1 1
.
.
.
.
user100 1 0 1
According to the scheme, if an original review was 1 (good) then we put 1 in the cell and -1 in the cell if the review was 0 (bad). For no reviews, we put 0 in the cell. The csv file below explains the scheme. The rows are users and M in the column is movie and C is the choice.
UserID M1C1 M2C1 M3C1 . . ..............M200C1
user1 1 -1 0 -1
user2 -1 1 1 0
user3 -1 1 0 1
.
.
.
user100 1 -1 1 0
I measured cosine similarity and then clustered the users with sklearn's cosine_similarity and kmeans clustering algorithm. The code is:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
df = pd.read_csv('input_file.csv', sep=',', encoding='latin-1',
index_col=False)
df = df.set_index('UserID')
pairwise = pd.DataFrame(
cosine_similarity(df.values),
columns = df.index.values,
index = df.index
)
print (pairwise.round(2))
pairs = pairwise.unstack()
pairs.index.rename(['User A', 'User B'], inplace=True)
pairs = pairs.to_frame('cosine distance').reset_index()
A = pairs[
(pairs['cosine distance'] < 0.00)
& (pairs['User A'] != pairs['User B'])
]
print(A)
kmeans = KMeans(n_clusters=2, init ='k-means++', max_iter=50,
n_init=5,random_state=0 )
y_kmeans = kmeans.fit_predict(pairwise)
print(y_kmeans)
frame = pd.DataFrame(pairwise)
frame['cluster'] = y_kmeans
print(frame['cluster'])
print(frame['cluster'].value_counts())
With this code, i am getting the cosine similarity for all the pairs and i can filter the pairs based on the value of cosine similarity. I am also getting a list of clusters for the users. I want to know that am i doing it right ? Is it right to calculate the cosine similarity first and then pass the values to kmeans ? As by default, the sklearn's kmeans function uses euclidean distance.
I will really appreciate some help.
Thanks..