1
votes

I'm working on a recommendation engine which uses an item-based collaborative filter to create recommendations for restaurants. Each restaurant has reviews with a rating from 1-5.
Every recommendation algorithm struggles with the data sparsity issue, so I have been looking for solutions to calculate a correct correlation.

I'm using an adjusted cosine similarity between restaurants.

When you want to compute a similarity between restaurants, you need users who have rated both restaurants. But what would be the minimum of users who have rated both restaurants to get a correct correlation?

From testing, I have discovered that 1 set of users who have rated both restaurants results in bad similarities (Obviously). Often it's -1 or 1. So I have increased it to 2 set of users who have both restaurants, which gave me better similarities. I just find it difficult to determine if this similarity is good enough. Is there a method which either tests the accuracy of this similarity or are there guidelines on how what the minimum is?

1

1 Answers

0
votes

The short answer is a parameter sweep: try several values of "minimum users who have rated both restaurants" and measure the outcomes. With more users, you'll get a better sense of the similarity between items (restaurants). But your similarity information will be sparser. That is, you'll focus on the more popular items and be less able to recommend items in the long tail. This means you'll always have a tradeoff, and you should measure everything that will allow you to make the tradeoff. For instance, measure predictive accuracy (e.g., RMSE) as well as the number of items possible to recommend.

If your item space becomes too sparse, you may want to find other ways to do item-item similarity beyond user ratings. For instance, you can use content-based filtering methods to include information about each restaurants' cuisine, then create an intermediate step to learn each user's cuisine preferences. That will allow you to do recommendations even when you don't have item-item similarity scores.