0
votes

I have been using the Mahout library to implement a recommendation algorithm. I have used the EuclideanDistanceSimilarity class and so far my results seem fine.

My DataModel currenty consists of 500 ratings for 100 items which are rated on a scale of 1 to 5 such as

customer itemID rating

____1 ____2_____8

However the Apache Mahout API's states that "Note that the distance isn't normalized in any way; it's not valid to compare similarities computed from different domains (different rating scales, for example). Within one domain, normalizing doesn't matter much as it doesn't change ordering."

Will this impact the validity/reliability of my results as I capture more customers and items?

1

1 Answers

1
votes

The important part of your citation is the domain: As long as you add more data such as customers and items from the same system (domain) using the same 5-star rating scale, there's no need to normalize data.

If you started to add data from other systems using e.g. a 7-star rating, you'd somehow had to normalize these ratings in order to achieve comparability with your existing 5-star scale.

Let me add a few more words about rating based recommendation:

In general your approach is fine. The only problem with ratings is, that users tend to rate quite differently. One user may submit a "3" for an item she likes while another would assign a "5". Some recommenders therefore experiment with other approaches - e.g. converting numeric ratings to boolean, where any rating would be seen as a general "interest" towards the item.

I can highly recommend to capture a few statistics about your dataset - e.g. the average rating. If this should be close to 1 or 5 (meaning most ratings are either very bad or very good and not equally distributed) you might want to give the boolean approach a try...