8
votes

I have been working with Mahout in the past few days trying to create a recommendation engine. The project I'm working on has the following data:

  • 12M users
  • 2M items
  • 18M user-item boolean recommendations
  • I am now experimenting with 1/3 of the full set we have (i.e. 6M out of 18M recommendations). At any configuration I tried, Mahout was providing quite disappointing results. Some recommendations took 1.5 seconds while other took over a minute. I think a reasonable time for a recommendation should be around the 100ms timeframe.

    Why does Mahout work so slow?
    I'm running the application on a Tomcat with the following JVM arguments (even though adding them didn't make much of a difference):

    -Xms4096M -Xmx4096M -da -dsa -XX:NewRatio=9 -XX:+UseParallelGC -XX:+UseParallelOldGC
    

    Below are code snippets for my experiments:

    User similarity 1:

    DataModel model = new FileDataModel(new File(dataFile));
    UserSimilarity similarity = new CachingUserSimilarity(new LogLikelihoodSimilarity(model), model);
    UserNeighborhood neighborhood = new NearestNUserNeighborhood(10, Double.NEGATIVE_INFINITY, similarity, model, 0.5);
    recommender = new GenericBooleanPrefUserBasedRecommender(model, neighborhood, similarity);
    

    User similarity 2:

    DataModel model = new FileDataModel(new File(dataFile));
    UserSimilarity similarity = new CachingUserSimilarity(new LogLikelihoodSimilarity(model), model);
    UserNeighborhood neighborhood = new CachingUserNeighborhood(new NearestNUserNeighborhood(10, similarity, model), model);
    recommender = new GenericBooleanPrefUserBasedRecommender(model, neighborhood, similarity);
    

    Item similarity 1:

    DataModel dataModel = new FileDataModel(new File(dataFile));
    ItemSimilarity itemSimilarity = new LogLikelihoodSimilarity(dataModel);
    recommender = new GenericItemBasedRecommender(dataModel, itemSimilarity);
    
    2

    2 Answers

    4
    votes

    With the gracious help of the Mahout community via its mailing list, we have found a solution to my problem. All of the code related to the solution was committed into Mahout 0.6. More details can be found in the corresponding JIRA ticket.

    Using VisualVM I found that the performance bottleneck was in the computation of item-item similarities. This was addressed by @Sean using a very simple but effective fix (see the SVN commit for more details)

    Additionally, we have discussed how to improve the SamplingCandidateItemsStrategy to allow finer control over the sampling rate.

    Finally, I did some testing with my application with the aforementioned fixes. All the recommendations took less than 1.5 seconds with the overwhelming majority taking less than 500ms. Mahout could easily handle 100 recommendations per second (I did not try to stress it more than that).

    2
    votes

    Small suggestion: your last snippet should use GenericBooleanPrefItemBasedRecommender.

    For your data set, the item-based algorithm should be best.

    This sounds a little slow, and minutes is way too long. The culprit is lumpy data; time can scale with the number of ratings a user has provided.

    Look at SamplingCandidateItemsStrategy. This will let you limit the amount of work done in this regard by sampling in the face of particularly dense data. You can plug this in to GenericBooleanPrefItemBasedRecommender instead of using the default. I think this will give you a lever to increase speed and also make response time more predictable.