Scalable real time item based mahout recommender with precomputed item similarities using item similarity hadoop job?

Question

I have the following setup:

boolean data: (userid, itemid)

hadoop based mahout itemSimilarityJob with following arguements: --similarityClassname Similarity_Loglikelihood --maxSimilaritiesPerItem 50 & others (input,output..)

item based boolean recommender: -model MySqlBooleanPrefJDBCDataModel -similarity MySQLJDBCInMemoryItemSimilarity -candidatestrategy AllSimilarItemsCandidateItemsStrategy -mostSimilarItemsCandidateStrategy AllSimilarItemsCandidateItemsStrategy

Is there a way to use similarity cooccurence in my setup to get final recommendations? If I plug SIMILARITY_COOCCURENCE in the job, the MySqlJDBCInMemorySimilarity precondition checks fail since the counts become greater than 1. I know I can get final recommendations by running the recommender job on the precomputed similarities. Is there way to do this real time using the api like in the case of similarity loglikelihood (and other similarity metrics with similarity values between -1 & 1) using MysqlInMemorySimilarity?
How can we cap the max no. of similar items per item in the item similarity job. What I mean here is that the allsimilaritemscandidatestrategy calls .allsimilaritems(item) to get all possible candidates. Is there a way I can get say top 10/20/50 similar items using the API. I know we can pass a --maxSimilaritiesPerItem to the item similarity job but i am not completely sure as to what is stands for and how it works. If I set this to 10/20/50, will I be able to achieve what stated above. Also is there way to accomplish this via the api?
I am using a rescorer for filtering out and rescoring final recommendations. With rescorer, the calls to /recommend/userid?howMany=10&rescore={..} & to /similar/itemid?howMany=10&rescore{..} are taking way to longer (300ms-400ms) compared to (30-70ms) without the rescorer. I m using redis as an in memory store to fetch rescore data. The rescorer also receives some run-time data as shown above. There are only a few checks that happen in rescorer. The problem is that as the no. of item preferences for a particular user increase (> 100), the no. of calls to isFiltered() & rescore() increase massively. This is mainly due to the fact that for every user preference, the call to candidateStrategy.getCandidatItems(item) returns around (100+) similar items for each and the rescorer is called for each of these items. Hence the need to cap the max number of similar items per item in the job. Is this correct or am I missing something here? Whats the best way to optimise the rescorer in this case?

The MysqlJdbcInMemorySimilarity uses GenericItemSimilarity to load item similarities in memeory and its .allsimilaritems(item) returns all possible similar items for a given item from the precomputed item similarities in mysql. Do i need to implement my own item similarity class to return top 10/20/50 similar items. What about the if user's no. of preferences continue to grow?

It would be really great if anyone can tell me how to achieve the above? Thanks heaps !

Sean Owen Sean Owen · Accepted Answer · 2012-08-31T15:50:00

What Preconditions check are you referring to? I don't see them; I'm not sure if similarity is actually prohibited from being > 1. But you seem to be asking whether you can make a similarity function that just returns co-occurrence, as an ItemSimilarity that is not used with Hadoop. Yes you can; it does not exist in the project. I would not advise this; LogLikelihoodSimilarity is going to be much smarter.

You need a different CandidateItemStrategy, particularly, look at SamplingCandidateItemsStrategy and its javadoc. But this is not related to Hadoop, rather than run-time element, and you mention a flag to the Hadoop job. That is not the same thing.

If rescoring is slow, it means, well, the IDRescorer is slow. It is called so many times that you certainly need to cache any lookup data in memory. But, reducing the number of candidates per above will also reduce the number of times this is called.

No, don't implement your own similarity. Your issue is not the similarity measure but how many items are considered as candidates.

I am the author of much of the code you are talking about. I think you are wrestling with exactly the kinds of issues most people run into when trying to make item-based work at significant scale. You can, with enough sampling and tuning.

However I am putting new development into a different project and company called Myrrix, which is developing a sort of 'next-gen' recommender based on the same APIs, but which ought to scale without these complications as it's based on matrix factorization. If you have time and interest, I strongly encourage you to have a look at Myrrix. Same APIs, the real-time Serving Layer is free/open, and the Hadoop-based Computation Layer backed in also available for testing.

Scalable real time item based mahout recommender with precomputed item similarities using item similarity hadoop job?

1 Answers