Bottleneck in item-based collaborative filter using pandas data frames and nested for loops in Python

Question

I have an input data set (in csv format) that consists of 100246 rows and 7 columns. It is movie-rating data taken from http://grouplens.org/datasets/movielens/. The head of my dataframe is:

In [5]: df.head()
Out[5]: 
   movieId                                       genres  userId      rating  \
0        1  Adventure|Animation|Children|Comedy|Fantasy       1       5   
1        1  Adventure|Animation|Children|Comedy|Fantasy       2       3   
2        1  Adventure|Animation|Children|Comedy|Fantasy       5       4   
3        1  Adventure|Animation|Children|Comedy|Fantasy       6       4   
4        1  Adventure|Animation|Children|Comedy|Fantasy       8       3   

 imdbId       title  relDate  
0  114709  Toy Story      1995  
1  114709  Toy Story      1995  
2  114709  Toy Story      1995  
3  114709  Toy Story      1995  
4  114709  Toy Story      1995

Using this data set, I am calculating similarity scores between each pair of movies using the euclidean distance between user-ratings (i.e. if two movies are rated similarly by the sample of users, then the movies are highly correlated). At the moment, this is performed by iterating over all movie pairs and using an if-statement to find only those pairs that contain the current movie of interest:

  for i,item in enumerate(df['movieId'].unique()):
      for j, item_comb in enumerate(combinations(df['movieId'].unique(),2)):
        if(item in item_comb ):
              ## calculate the similarity score between item i and the other item in item_comb

However, given that there are 8927 different movies in the data set, the number of pairs is ~40M. This is a major bottleneck. So my question is what are some ways that I can speed up my code?

If you're trying to produce some kind of matrix shouldn't you try to expand all genres into their own columns, then populate the rows with 1/0 or True/False and then just and the filter with the user selection to produce a similarity calculation? — EdChum
The similarity score isn't based on the genres, but rather the correlation (or euclidean distance) between the user ratings for both movies. So for two movies, I have two vectors (x,y) representing the movie ratings given by the users. I've edited the post to state this explicitly. — Feynman27
There are lots of euclidian distance questions like this, most with a numpy or scipy tag. — hpaulj
Sounds like a machine learning algorithm, something like Collaborative filtering - maybe you can Google this and find a library that does this efficiently for you — alex314159
Yes, that's correct. I am using an item-based collaborative filter. I will see what I can dig up on Google. — Feynman27

Feynman27 Feynman27 · Accepted Answer · 2015-11-16T17:42:19

In this link (collaborative-filtering scalability) it looks like MongoDB can be used to employ a collaborative filter on extremely large data sets.

Spark (collaborative-filter with Apache Spark) may also be suitable.

Bottleneck in item-based collaborative filter using pandas data frames and nested for loops in Python

4 Answers