I am running a collaborative filtering in SparkML for implicit feedback dataset.
Lets say my dataset is like below.
User Item viewed
1 A 1
1 B 2
2 A 3
2 C 4
3 A 3
3 B 2
3 D 6
So I have around 56K unique users and 8.5K unique items. However, each user doesn't have a row for each item and only has for items they have rated/viewed in this case. It's an implicit feedback dataset where Viewed column is the no. of times a user has viewed an item.
Now this is also the format SparkML expects (userid, itemid, rating).
However my question is, can I just feed in this dataset as it is for SparkML ALS algorithm or I need to create a cartesian join of all user and items?. Reason being that since there is not all combination of user and item in this dataset, the ALS algorithm will not see all combinations of user and item and hence will give Null values in prediction for those.
So for items which a user has not seen yet, we should create a row for that too for each user and give view as 0? Like as below?
User Item Viewed
1 A 1
1 B 2
1 C 0
1 D 0
2 A 3
2 C 4
2 B 0
2 D 0
3 A 3
3 B 2
3 D 6
3 C 0
If this is correct, then I have 56K unique users and 8.5K unique items. That would make 56*8.5K= 400MM rows.
Imagine if the users are millions and items millions. In that case it would be huge dataset.
I did the cartesian thing and it seems to give correct prediction with no Null values as earlier. But I want to confirm if this is how dataset needs to be prepared for Spark Collaborative filtering?
Am I correct here?
EDIT:
The other question asked about how to create a cartesian join and not if the cartesian join is correct dataset format for Spark ML. So it's a different question. Please don't close.