Optimize Cypher Query for Movie Recommendation on Large Dataset

Question

I'm currently working on movie recommendation using MovieLens 20m dataset after reading https://markorodriguez.com/2011/09/22/a-graph-based-movie-recommender-engine/. Node Movie connects to Genre with relationship hasGenre, Node Movie connects to User with relationship hasRating. I'm trying to retrieve all movies which are most highly co-rated (co-rated > 3.0) with a query (e.g. Toy Story) that share all genres with Toy Story. Here's my Cypher query:

MATCH (inputMovie:Movie {movieId: 1})-[r:hasGenre]-(h:Genre)
WITH inputMovie, COLLECT (h) as inputGenres
MATCH (inputMovie)<-[r:hasRating]-(User)-[o:hasRating]->(movie)-[:hasGenre]->(genre) 
WITH  inputGenres,  r, o, movie, COLLECT(genre) AS genres 
WHERE ALL(h in inputGenres where h in genres) and (r.rating>3 and o.rating>3)  
RETURN movie.title,movie.movieId, count(*) 
ORDER BY count(*) DESC

However, it seems that my system cannot handle it (using 16GB of RAM, Core i7 4th gen, and SSD). When I'm running the query, it peaks to 97% of RAM then Neo4j shutdowns unexpectedly (probably due to heap size or else, due to RAM size).

Do I make the query correct? I'm newbie in Neo4j so probably I make the query incorrectly.
Please suggest how to optimize such query?
How can I optimize the Neo4j so it can handle large dataset with my system's spec according to the query?

Thanks in advance.

Tezra Tezra · Accepted Answer · 2018-10-30T14:45:16

First, your Cypher can be simplified for more efficient planning by only matching what we need, and handling the rest in the WHERE (so that filtering can possibly be done while matching)

MATCH (inputMovie:Movie {movieId: 1})-[r:hasGenre]->(h:Genre)
WITH inputMovie, COLLECT (h) as inputGenres
MATCH (inputMovie)<-[r:hasRating]-(User)-[o:hasRating]->(movie)
WHERE (r.rating>3 and o.rating>3) AND ALL(genre in inputGenres WHERE (movie)-[:hasGenre]->(genre))
RETURN movie.title,movie.movieId, count(*) 
ORDER BY count(*) DESC

Now, if you don't mind adding data to the graph to find the data you want, another thing you can do is split the query into tiny bits and "cache" the result. So for example

// Cypher 1
MATCH (inputMovie:Movie {movieId: 1})-[r:hasGenre]->(h:Genre)
WITH inputMovie, COLLECT (h) as inputGenres
MATCH (movie:Movie)
WHERE ALL(genre in inputGenres WHERE (movie)-[:hasGenre]->(genre))
// Merge so that multiple runs don't create extra copies
MERGE (inputMovie)-[:isLike]->(movie)

// Cypher 2
MATCH (movie:Movie)<-[r:hasRating]-(user)
WHERE r.rating>3
// Merge so that multiple runs don't create extra copies
MERGE (user)-[:reallyLikes]->(movie)

// Cypher 3
MATCH (inputMovie:Movie{movieId: 1})<-[:reallyLikes]-(user)-[:reallyLikes]->(movie:Movie)<-[:isLike]-(inputMovie)
RETURN movie.title,movie.movieId, count(*) 
ORDER BY count(*) DESC

Optimize Cypher Query for Movie Recommendation on Large Dataset

1 Answers