1
votes

I'm developing an item-based collaborative filter using an adjusted cosine similarity between restaurants to generate recommendations. I got everything set up and it works well, but when I try to simulate possible test scenarios, I got some interesting results.

I'll start with my test data. I have 2 restaurants where I want to calculate a similarity between, and 3 users who all have rated the 2 restaurants the same. I'll explain it using the following matrix:

               User 1 | User 2 | User 3
Restaurant 1 |   1    |   2    |   1
Restaurant 2 |   1    |   2    |   1

I'm trying to calculate the similarity using the following function:
Restaurants are called Subject in my code.

public double ComputeSimilarity(Guid subject1, Guid subject2, IEnumerable<Review> allReviews)
{
    //This will create an IEnumerable of reviews from the same user on the 2 restaurants.
    var matches = (from R1 in allReviews.Where(x => x.SubjectId == subject1)
                   from R2 in allReviews.Where(x => x.SubjectId == subject2)
                   where R1.UserId == R2.UserId
                   select new { R1, R2 });            
    double num = 0.0f;
    double dem1 = 0.0f;
    double dem2 = 0.0f;
    //For the similarity between subjects, we use an adjusted cosine similarity.
    //More information on this can be found here: http://www10.org/cdrom/papers/519/node14.html
    foreach (var item in matches)
    {
        //First get the average of all reviews the user has given. This is used in the adjusted cosine similarity, read the article from the link for further explanation
        double avg = allReviews.Where(x => x.UserId == item.R1.UserId)
                               .Average(x => x.rating);
        num += ((item.R1.rating - avg) * (item.R2.rating - avg));
        dem1 += Math.Pow((item.R1.rating - avg), 2);
        dem2 += Math.Pow((item.R2.rating - avg), 2);
    }
    return (num / (Math.Sqrt(dem1) * Math.Sqrt(dem2)));
}

My review looks like this:

public class Review
{
    public Guid Id { get; set; }
    public int rating { get; set; } //This can be an integer between 1-5
    public Guid SubjectId { get; set; } //This is the guid of the subject the review has been left on
    public Guid UserId { get; set; } //This is the guid of the user who left the review
}

In all other scenarios will the function calculate a correct similarity between subjects. But when I use the test data above (Where I expected a perfect similarity) it results in an NaN.

Is this an error in my code or is this an error in the adjusted cosine similarity? And if it results in NaN, is it good to catch this and insert a 1 for similarity?

Edit: I have tried with other matrices too, and I got even more interesting results.

               User 1 | User 2 | User 3 | User 4 | User 5
Restaurant 1 |   1    |   2    |   1    |   1    |   2
Restaurant 2 |   1    |   2    |   1    |   1    |   2

This still results in NaN.

               User 1 | User 2 | User 3 | User 4 | User 5
Restaurant 1 |   2    |   2    |   1    |   1    |   2
Restaurant 2 |   1    |   2    |   1    |   1    |   2

This results in -1 for similarity

1
Well that's just a property of formula you use: when all users have the same rating for all movies (so, identical rows in your table above) - denominator is zero and so the result is undefined (represented by NaN in .NET). - Evk
Is this preventable? I assume that this scenario is very small, but it is possible. - user4189129
I think, since you know situation when it arises, you can just treat it in a special way, not prevent. - Evk
By the way, shouldn't you compute two averages, for item.R1.UserId and item.R2.UserId? Now you compute average for first user and use it in all calculations, even related to the second user. - Evk
Yes, my bad, of course it's the same user. What I think is - if similarity is undefined by this measure, it's not correct to assign 0 or 1 or any other value to it. Instead you may try to use another measure in this case. For example, for your example simple consine similarity (not adjusted) will give 1, which is what you would expect. - Evk

1 Answers

1
votes

It seems your algorithm is implemented correctly. Thing is this formula can indeed be undefined at some points for perfectly reasonable sets. You can treat this case as "this measure (adjusted cosine similarity) has nothing to say about provided sets", so it is not correct to assign any arbitrary value (0, 1, -1). Instead, use different measure in this case. For example, simple (non-adjusted) cosine similarity will give "1" as a result, which is what you might expect.