We were trying to build an online recommender (collaborative filtering user-user) using cosine similarity with data in Neo4j.
**A difference was the input data set is a boolean preference (as opposed to a rating) ** for 1 mil users X ~700 products. eg. User_ID, Product_ID, Preference 11,48989399,1
Created nodes for users and products with index on id (user_id, product_id)
I tried writing a cypher query to get the top 20 closest neighbours based on the formula
Similarity = (Products liked by both users) / sqrt(# of products liked by user1) * sqrt(# of products liked by user2)
Below is the query:
MATCH (a:Users)-[d]->() using index a:Users(id) where a.id =1
WITH a.id as user1, count(d) as user1_prod
MATCH (a:Users)-[]->()<-[dd]-others using index a:Users(id) where a.id =1
WITH user1, user1_prod, others, count(dd) as intersect
MATCH others-[b1]->() with user1, others.id as user2, intersect, user1_prod, count(b1) as user2_prod
WITH user1, user2, intersect/(sqrt(user1_prod) * sqrt(user2_prod)) as similarity
RETURN user2, similarity order by similarity desc limit 20;
The query returns results in close to 22 seconds post which the recommendation of products is scalable and fast.
Is there a better way to write the cypher for similarity since the graph might be more dense in further scenarios.
Details: Kernel version Neo4j - Graph Database Kernel (neo4j-kernel), version: 2.1.6
772 772 nodes
neostore.relationshipstore.db.mapped_memory 3078M
CentOS release 6.6 (Final)