I have a pretty standard Mahout item-based recommender for news articles (using click data, so preferences are Boolean):
DataModel dataModel = new ReloadFromJDBCDataModel(
new PostgreSQLBooleanPrefJDBCDataModel(localDB, ...)
);
ItemSimilarity itemSimilarity = new TanimotoCoefficientSimilarity(dataModel);
ItemBasedRecommender recommender = new GenericBooleanPrefItemBasedRecommender(dataModel, itemSimilarity);
I am experimenting with injecting content-based knowledge into the recommender, so that I can most highly recommend articles that are not only similar in the normal collaborative filtering sense, but also similar in the sense that they share many common terms.
The article content similarities (cosine similarity of TF-IDF vectors) are precomputed using a Mahout batch and read from a DB. However, there will be many pairs of articles for which there is no similarity data. This is for 2 reasons:
The article content similarity data will be updated less often than the data model of user-item preferences, so there will be a lag before new articles have their content similarity calculated.
Ideally I would like to load all content similarity data into memory, so I will only store the top 20 similarities for each article.
So, for a given pair of articles, I have:
- The item similarity (Tanimoto) 0 <= s1 <= 1
- The content similarity (Cosine) 0 <= s2 <=1 (maybe null)
In the case where the content similarity is not null, I want to use its value to weight the item similarity, in order to give a boost to articles with similar contents.
My questions are:
- Is it reasonable to try to combine these measures, or am I attempting something crazy?
- What is a sensible formula to combine these 2 values into one similarity score?
- Is this best implemented as a custom
ItemSimilarity
or as aRescorer
?