I have an input data set (in csv format) that consists of 100246 rows and 7 columns. It is movie-rating data taken from http://grouplens.org/datasets/movielens/. The head of my dataframe is:
In [5]: df.head()
Out[5]:
movieId genres userId rating \
0 1 Adventure|Animation|Children|Comedy|Fantasy 1 5
1 1 Adventure|Animation|Children|Comedy|Fantasy 2 3
2 1 Adventure|Animation|Children|Comedy|Fantasy 5 4
3 1 Adventure|Animation|Children|Comedy|Fantasy 6 4
4 1 Adventure|Animation|Children|Comedy|Fantasy 8 3
imdbId title relDate
0 114709 Toy Story 1995
1 114709 Toy Story 1995
2 114709 Toy Story 1995
3 114709 Toy Story 1995
4 114709 Toy Story 1995
Using this data set, I am calculating similarity scores between each pair of movies using the euclidean distance between user-ratings (i.e. if two movies are rated similarly by the sample of users, then the movies are highly correlated). At the moment, this is performed by iterating over all movie pairs and using an if-statement to find only those pairs that contain the current movie of interest:
for i,item in enumerate(df['movieId'].unique()):
for j, item_comb in enumerate(combinations(df['movieId'].unique(),2)):
if(item in item_comb ):
## calculate the similarity score between item i and the other item in item_comb
However, given that there are 8927 different movies in the data set, the number of pairs is ~40M. This is a major bottleneck. So my question is what are some ways that I can speed up my code?
1/0
orTrue/False
and then justand
the filter with the user selection to produce a similarity calculation? - EdChumnumpy
orscipy
tag. - hpaulj