0
votes

I am trying to understand the performance of neo4j in real-time recommendation systems.

The following is a cypher query (taken from their sandbox) which computes top 100 most similar users (in cosine-distance) to the query user "Cynthia Freeman":

MATCH 
    (p1:User {name: "Cynthia Freeman"})-[x:RATED]->(m:Movie)<-[y:RATED]-(p2:User)
WITH 
    COUNT(m) AS numberMovies, 
    SUM(x.rating * y.rating) AS xyDotProduct,
    SQRT(REDUCE(xDot = 0.0, a IN COLLECT(x.rating) | xDot + a^2)) AS xLength,
    SQRT(REDUCE(yDot = 0.0, b IN COLLECT(y.rating) | yDot + b^2)) AS yLength,
    p1, p2 
WHERE
    numberMovies > 10
RETURN 
    p1.name, p2.name, xyDotProduct / (xLength * yLength) AS sim
ORDER BY sim DESC
LIMIT 100;

If my understanding is correct, there's no magic behind the LIMIT clause, as the distance computation still needs to be done vs. all other users, so resolving this query in real-time seems a bit of a stretch, unless neo4j is doing something behind the scenes.

In another example, they pre-compute this [:SIMILARITY] relationship between user nodes and store it in the graph, thus querying the top-N most similar users becomes an ordering of nodes. This will intuitively make the graph dense, so there are no storage advantages over simply using a similarity matrix.

Am I missing something fundamental about the way graph databases (and neo4j in particular) work? How can this scale to real-time applications, where there can be tens of thousands of users, and even more products that they interact with?

2

2 Answers

2
votes

If you want to do real-time recommendations using some sort of cosine distance metric on tens of thousands of nodes or more, it is probably best to store the precomputed values as relationships.

As for making the graph dense, you can limit the SIMILAR relationship to top K similar nodes and also define the similarity cutoff threshold, which can make your graph as sparse as you would like to. You can only store relevant results. So, for example, in a graph of 10 thousand nodes, if every item has a connection to the top 10 other nodes, this is not a really dense graph. If you also remove duplicate relationships that point from one node to another and back, you could remove them even more. So if there are 10k*10k (divided by two if you are treating the relationships as undirected) relationships possible, you won't have a billion possible relationships, but only 100k at most.

The Graph Data Science library supports two algorithms for calculating cosine distance:

The first naive version calculates the distance between all pairs and can be tuned with topK and similarityCutoff parameters.

Just recently, the optimized implementation of the kNN algorithm was added in the GDS 1.4 pre-release. It uses the implementation described in this article: https://dl.acm.org/doi/abs/10.1145/1963405.1963487

However, for real-time calculation of similarity between 10k+ nodes, it might still take more than 100ms you would max the real-time response, so going with the pre-computed similarity relationships makes sense.

1
votes

Aside from @TomažBratanič's great suggestions, your existing query can be made more efficient. It is performing mathematical calculations for every p1/p2 pair, even for pairs that are later filtered out because the number of shared movies does not exceed 10. Instead, you should try filtering out unwanted p1/p2 pairs before you do the calculations.

For example:

MATCH
    (p1:User {name: "Cynthia Freeman"})-[x:RATED]->(m:Movie)<-[y:RATED]-(p2:User)
WITH
    COLLECT({xr: x.rating, yr: y.rating}) AS data
    p1, p2
WHERE
    SIZE(data) > 10
WITH
    REDUCE(s = 0, d IN data | s + d.xr * d.yr) AS xyDotProduct,
    SQRT(REDUCE(xDot = 0.0, a IN data | xDot + a.xr^2)) AS xLength,
    SQRT(REDUCE(yDot = 0.0, b IN data | yDot + b.yr^2)) AS yLength,
    p1, p2
RETURN 
    p1.name, p2.name, xyDotProduct / (xLength * yLength) AS sim
ORDER BY sim DESC
LIMIT 100;