1
votes

I have been working with Mahout to create a recommendation engine based on the following data:

  • 100k users
  • 10k items
  • 4M ratings

I'm running it on a Tomcat with the following JVM arguments :

-Xms1024M -Xmx1024M -da -dsa -XX:NewRatio=9 -server

Recommendations took about 6s, it seems slow ! How could I improve Mahout performances ?

I'm using the following code :

This part is run once at startup :

JDBCDataModel jdbcdatamodel = new MySQLJDBCDataModel(dataSource);
dataModel = new ReloadFromJDBCDataModel(jdbcdatamodel);

ItemSimilarity similarity = new CachingItemSimilarity(new EuclideanDistanceSimilarity(model), model);
SamplingCandidateItemsStrategy strategy = new SamplingCandidateItemsStrategy(10, 5);
recommender = new CachingRecommender(new GenericItemBasedRecommender(model, similarity, strategy, strategy));

And, for every user request I do :

recommender.recommend(userId, howMany);
1
The bottleneck is the database accessJulian Ortega
@JulianOrtega ReloadFromJDBCDataModel loads datamodel from the database into memory so this takes time only once, or I'm missing something ?Thibaud
Well, since you didn't actually share the code that generates the recommendations, I had to take a guessJulian Ortega
@JulianOrtega Sorry for the imprecision, I edited my post to be more precise.Thibaud

1 Answers

1
votes

I would suggest a different approach. Use a nightly job, to pre-calculate recommendations for ALL users, and load results nightly into MySQL table. That will make showing the recommendations nothing more than a simple DB call.

Since you have 10K items, for calculating recommendations for a single user mahout has to internally multiply (10k x 10K) matrix with another (10K X 1) matrix. And 6 seconds seems quite fast considering the size. Reference

Now if you use the RecommenderJob on hadoop and AWS EMR, it will take ~ <10 mins to process data on your scale. Or you can do the same job in a non-distributed way, by simply using a loop and pre-calculating for all users sequentially. The downside is that your recommendations are always behind by 1 day or 6 hours or whatever frequency you choose for job.