1
votes

I am training the ALS model for recommendations. I have about 200m ratings from about 10m users and 3m products. I have a small cluster with 48 cores and 120gb cluster-wide memory.

My code is very similar to the example code spark/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala code.

I have a couple of questions:

All steps up to model training runs reasonably fast. Model training is under 10 minutes for rank 20. However, the model.recommendProductsForUsers step is either slow or just does not work as the code just seems to hang at this point. I have tried user and product blocks sizes of -1 and 20, 40, etc, played with executor memory size, etc. Can someone shed some light here as to what could be wrong?

Also, is there any example code for the ml.recommendation.ALS algorithm? I can figure out how to train the model but I don't understand (from the documentation) how to perform predictions?

Thanks for any information you can provide.

1

1 Answers

4
votes

The ALS algorithm essentially outputs two things:

  1. model.productFeatures: Int -> Array[Double] where Int is the product ID, and Array[Double] is the vector representing this product.
  2. model.userFeatures: Int -> Array[Double] where Int is the user ID, and Array[Double] is the vector representing this user.

To make a prediction, we take the dot product of two vectors. To compute a similarity, we take the cosine of the angle between two vectors. So, to:

  1. Predict product P for user U, we compute U dot P;
  2. Compute similarity between U1 and U2, we compute (U1 dot U2) / (||U1||_2 x ||U2||_2);
  3. Compute similarity between P1 and P2, we compute (P1 dot P2) / (||P1||_2 x ||P2||_2)

The reason then that model.recommendProductsForUsers is so slow is because it is computing that dot product for all users, for all products. Given rank r in your model, this means you must make U x P x 2r calculations. In your case this would be 10m x 3m x 2x20 = 6 x 1.2^15 calculations - a lot!

A much better approach is to ignore this brute-force helper function, introduce some heuristics to cut the number of products that could be predicted for each user, and compute the predictions yourself. For example, if you have a product hierarchy, you could limit the products that can be predicted to those in the categories the user has previously browsed, or that are within one branch from these. This is a problem that every recommender system faces, but there is no one-size-fits-all solution to it. To make things fast though, you need to do the computation yourself with some filtering heuristics.