0
votes

In mahout there is implemented method for item based Collaborative filtering called itemsimilarity.

In the theory, similarity between items should be calculated only for users who ranked both items. During testing I realized that in mahout it works different.

In below example the similarity between item 11 and 12 should be equal 1, but mahout output is 0.36.

Example 1. items are 11-12

Similarity between items:

101     102     0.36602540378443865

Matrix with preferences:

    11  12
1       1
2       1
3   1   1
4       1

It looks like mahout treats null as 0.

Example 2. items are 101-103.

Similarity between items:

101     102     0.2612038749637414
101     103     0.4340578302732228
102     103     0.2600070276638468

Matrix with preferences:

    101 102 103
1       1   0.1
2       1   0.1
3       1   0.1
4   1   1   0.1
5   1   1   0.1
6       1   0.1
7       1   0.1
8       1   0.1
9       1   0.1
10      1   0.1

Similarity between items 101 and 102 should be calculated using only ranks for users 4 and 5, and the same for items 101 and 103 (that should be based on theory). Here (101,103) is more similar than (101,102), and it shouldn't be.

Both examples were run without any additional parameters.

Is this problem solved somwhere, somehow? Any ideas?

Source: http://files.grouplens.org/papers/www10_sarwar.pdf

1

1 Answers

0
votes

Those users are not identical. Collaborative filtering needs to have a measure of cooccurrence and the same items do not cooccur between those users. Likewise the items are not identical, they each have different users who prefered them.

The data is turned into a "sparse matrix" where only non-zero values are recorded. The rest are treated as a 0 value, this is expected and correct. The algorithms treat 0 as no preference, not a negative preference.

It's doing the right thing.