3
votes

I have a boolean preference recommender based on user similarity. My data set essentially contains relations where ItemId are articles the user has decided to read. I'd like to add a second data model containing where ItemId is a subscription to a particular topic.

The only way I can imagine doing this is by merging the two together, offsetting the subscription IDs so that they don't collide with the article IDs. For weighting I considered dropping the boolean preference setup and introducing preference scores, where the articles subset has a preference score of 1 (for example) and the subscriptions subset has a preference score of 2.

I'm not sure if this would work, however, because the preference score isn't exactly analogous to the sort of weighting I'm after; they probably include some concept of lower scores representing dissatisfaction.

I have to imagine there's a better way to do this or at least that there are tweaks to my plan which would make it work more along the lines I desire.

1

1 Answers

4
votes

I think you're thinking of it in the right way. Yes you want a bit more expressiveness than a simple exists/doesn't exist for subscriptions and articles since they mean somewhat different things. I would suggest picking weights that reflect their relative frequency. For example if users have read 100K articles over all time, and made 10000 subscriptions, then you might pick a subscription weight to be "10" and a read weight to be "1".

This doesn't quite work if you treat those values as preference scores, for a number of reasons. It works better if you use an approach that treats them like what they are, which are linear weights.

I would point you to the ALS-WR algorithm, which is specifically designed for this type of input. For example: Collaborative Filtering for Implicit Feedback Datasets

This is implemented in Mahout as ParallelALSFactorizationJob on Hadoop. It works nicely though requires Hadoop. (I can't take credit for that, though I did write most of the recommender code in Mahout.)

Advertisement: I'm working on commercializing a "next generation" system, evolved from my work in Mahout, as Myrrix. It is an implementation of ALS-WR and is ideal for your kind of input. It's quite easy to download and run, and doesn't need Hadoop.

Given that it may be directly suitable for your problem I don't mind plugging it here.