1
votes

I am running a collaborative filtering in SparkML for implicit feedback dataset.

Lets say my dataset is like below.

  User  Item    viewed
1   A   1
1   B   2
2   A   3
2   C   4
3   A   3
3   B   2
3   D   6

So I have around 56K unique users and 8.5K unique items. However, each user doesn't have a row for each item and only has for items they have rated/viewed in this case. It's an implicit feedback dataset where Viewed column is the no. of times a user has viewed an item.

Now this is also the format SparkML expects (userid, itemid, rating).

However my question is, can I just feed in this dataset as it is for SparkML ALS algorithm or I need to create a cartesian join of all user and items?. Reason being that since there is not all combination of user and item in this dataset, the ALS algorithm will not see all combinations of user and item and hence will give Null values in prediction for those.

So for items which a user has not seen yet, we should create a row for that too for each user and give view as 0? Like as below?

User    Item    Viewed
1   A   1
1   B   2
1   C   0
1   D   0
2   A   3
2   C   4
2   B   0
2   D   0
3   A   3
3   B   2
3   D   6
3   C   0

If this is correct, then I have 56K unique users and 8.5K unique items. That would make 56*8.5K= 400MM rows.

Imagine if the users are millions and items millions. In that case it would be huge dataset.

I did the cartesian thing and it seems to give correct prediction with no Null values as earlier. But I want to confirm if this is how dataset needs to be prepared for Spark Collaborative filtering?

Am I correct here?

EDIT:

The other question asked about how to create a cartesian join and not if the cartesian join is correct dataset format for Spark ML. So it's a different question. Please don't close.

1
Hey @baktaawar did you find a clear answer to your question? I would be quite interested! - Duesentrieb
@Duesentrieb Well not exactly, but I think what I have written here is correct. For implicit feedback you would have to do cartesian join unless you are ok with not having some recommendations. I had a long discussion with Sean Owen from Cloudera on email on this and he later agreed that what I said makes sense. I think this is something which Spark ppl have not realized and probably is like a bug. - Baktaawar

1 Answers

1
votes

This assumption is clearly wrong:

Reason being that since there is not all combination of user and item in this dataset, the ALS algorithm will not see all combinations of user and item and hence will give Null values in prediction for those.

and makes this question invalid. There is no need for all combination of user and item. All you need is a some data for each item and each user. Intuitively if you haven't seen an user or an answer, it won't be present in the computed factors, and you cannot reason about it. That's it.

About this:

So for items which a user has not seen yet, we should create a row for that too for each user and give view as 0? Like as below?

This might work to some extent with implicit feedback, but with explicit one it is clearly wrong. No rating is not the same as the lowest possible rating.