I am trying to understand the performance of neo4j in real-time recommendation systems.
The following is a cypher query (taken from their sandbox) which computes top 100 most similar users (in cosine-distance) to the query user "Cynthia Freeman":
MATCH
(p1:User {name: "Cynthia Freeman"})-[x:RATED]->(m:Movie)<-[y:RATED]-(p2:User)
WITH
COUNT(m) AS numberMovies,
SUM(x.rating * y.rating) AS xyDotProduct,
SQRT(REDUCE(xDot = 0.0, a IN COLLECT(x.rating) | xDot + a^2)) AS xLength,
SQRT(REDUCE(yDot = 0.0, b IN COLLECT(y.rating) | yDot + b^2)) AS yLength,
p1, p2
WHERE
numberMovies > 10
RETURN
p1.name, p2.name, xyDotProduct / (xLength * yLength) AS sim
ORDER BY sim DESC
LIMIT 100;
If my understanding is correct, there's no magic behind the LIMIT clause, as the distance computation still needs to be done vs. all other users, so resolving this query in real-time seems a bit of a stretch, unless neo4j is doing something behind the scenes.
In another example, they pre-compute this [:SIMILARITY] relationship between user nodes and store it in the graph, thus querying the top-N most similar users becomes an ordering of nodes. This will intuitively make the graph dense, so there are no storage advantages over simply using a similarity matrix.
Am I missing something fundamental about the way graph databases (and neo4j in particular) work? How can this scale to real-time applications, where there can be tens of thousands of users, and even more products that they interact with?