Evaluating mahout based Boolean recommendation engine - interpreting precision & recall

Question

I would like to evaluate a mahout based recommendation engine of a fashion E-Commerce Site. They use shopping card information about item bought thogether - so boolean. I want to evaluate the engine using precision and recall.

1) How can I use these metrics to evaluate the recommendation engine? Is it just possible to use these values when altering the algorithm and to then check against yourself?

2) Or does it make sense to compare to other algorithms (also using boolean data)? If yes, is there any benchmark of precision and recall available (e.g if precision is x and recall is y, then algorithm should be discarded or accepted)?

Hoping to find help I thank you in advance guys!

Sampo Pietikäinen Sampo Pietikäinen · Accepted Answer · 2015-05-24T20:17:12

Well in the Information Retrieval context the items are handled in boolean manner i.e., they’re either relevant or non relevant. Mahout’s GenericRecommenderIRStatsEvaluator utilizes data splitter to make a set from already preferred (or bought in your case, bought) items which represent relevant items. In mahout’s case the selected items are top-n most preferred items. So, since the ratings are boolean it just selects n preferred items. I don’t believe this would make evaluation itself drastically more inaccurate than with normal five star ratings since buying is pretty strong sign of preference. So:

1) If you have managed to make recommendations then you are able to evaluate the recommendations using precision and recall as metrics.

2) I have used a random recommender as an benchmark (just an implementation of an mahout recommender which selects n random items). It usually produces pretty low precision and recall so if the algorithm has lower precision and recall than random recommender it probably should be ditched. Other metric that would I look in offline evaluation phase is reach, since recommender which produces recommendations only to 80 users out of 6000 active users is pretty useless.

Also it should be noted that in academic papers the precision and recall metrics have been criticized when used as a sole metric. In the end the user decides what is relevant and what is not relevant. And a recommender, which produces slightly lower than the other, is not necessarily worse than the other. For example more novel or serendipitous recommendations may lover precision and recall.

Evaluating mahout based Boolean recommendation engine - interpreting precision & recall

1 Answers