3
votes

I have an input data set (in csv format) that consists of 100246 rows and 7 columns. It is movie-rating data taken from http://grouplens.org/datasets/movielens/. The head of my dataframe is:

In [5]: df.head()
Out[5]: 
   movieId                                       genres  userId      rating  \
0        1  Adventure|Animation|Children|Comedy|Fantasy       1       5   
1        1  Adventure|Animation|Children|Comedy|Fantasy       2       3   
2        1  Adventure|Animation|Children|Comedy|Fantasy       5       4   
3        1  Adventure|Animation|Children|Comedy|Fantasy       6       4   
4        1  Adventure|Animation|Children|Comedy|Fantasy       8       3   

 imdbId       title  relDate  
0  114709  Toy Story      1995  
1  114709  Toy Story      1995  
2  114709  Toy Story      1995  
3  114709  Toy Story      1995  
4  114709  Toy Story      1995 

Using this data set, I am calculating similarity scores between each pair of movies using the euclidean distance between user-ratings (i.e. if two movies are rated similarly by the sample of users, then the movies are highly correlated). At the moment, this is performed by iterating over all movie pairs and using an if-statement to find only those pairs that contain the current movie of interest:

  for i,item in enumerate(df['movieId'].unique()):
      for j, item_comb in enumerate(combinations(df['movieId'].unique(),2)):
        if(item in item_comb ):
              ## calculate the similarity score between item i and the other item in item_comb

However, given that there are 8927 different movies in the data set, the number of pairs is ~40M. This is a major bottleneck. So my question is what are some ways that I can speed up my code?

4
If you're trying to produce some kind of matrix shouldn't you try to expand all genres into their own columns, then populate the rows with 1/0 or True/False and then just and the filter with the user selection to produce a similarity calculation? - EdChum
The similarity score isn't based on the genres, but rather the correlation (or euclidean distance) between the user ratings for both movies. So for two movies, I have two vectors (x,y) representing the movie ratings given by the users. I've edited the post to state this explicitly. - Feynman27
There are lots of euclidian distance questions like this, most with a numpy or scipy tag. - hpaulj
Sounds like a machine learning algorithm, something like Collaborative filtering - maybe you can Google this and find a library that does this efficiently for you - alex314159
Yes, that's correct. I am using an item-based collaborative filter. I will see what I can dig up on Google. - Feynman27

4 Answers

0
votes

In this link (collaborative-filtering scalability) it looks like MongoDB can be used to employ a collaborative filter on extremely large data sets.

Spark (collaborative-filter with Apache Spark) may also be suitable.

0
votes

There are approaches which can convert the iterative similarity computation into matrix multiplications. If you're using cosine similarity, the conversion is explained in more detail in the answer to this stack exchange question.

The other approach is to use the pairwise similarity metrics in the scikit-learn package which has an implementation of cosine similarity available.

from scikit-learn.metrics.pairwise import cosine_similarity
user_ratings_df = ....            # create the user x item dataframe

# Note the dataframe is transposed to convert to items as rows 
item_similarity = cosine_similarity(user_ratings_df.T)
0
votes

Vectorization is better than for loop.

There are maybe two pandas functions helpful: pivot_table() and corr()

For example:

In [5]: pt = df.pivot_table(columns=['movieId'], index=['userId'], values='rating')
Out[5]: 
   movieId       1    2    3    4    5
   userId                           
         1       5   ...
         2       3   ...
         5       4   ...
         6       4   ...
         8       3   ...

In [6]: pt.corr()
Out[6]: 
   movieId       1    2    3    4    5
   movieId                           
         1       1.0     ...
         2       0.XXX   ...
         3       0.XXX   ...
         4       0.XXX   ...
         5       0.XXX   ...

Note that corr() here compute the standard correlation coefficient (pearson correlation) between movies rather than the euclidean distance. You can also use param min_periods to set the minimum number of observations required per pair of columns to have a valid result.

-1
votes

In this paper, it has disgusted that you may make your algorithm fast with another approach

In an amazon paper (2003), they have described it already as i remember correctly.

In summary, the most important idea behind this algorithm is to compute dot product of two vectors in another way rather than iterate through each element simply. By this mean, the algorithm only computes for two items when they have same customers. In other word, it skips for the 0 similarity computation.

   For each item in product catalog, I1
      For each customer C who purchased I1
        For each item I2 purchased by customer C
          Record that a customer purchased I1 and I2
      For each item I2
        Compute the similarity between I1 and I2