I'm currently working on movie recommendation using MovieLens 20m dataset after reading https://markorodriguez.com/2011/09/22/a-graph-based-movie-recommender-engine/. Node Movie connects to Genre with relationship hasGenre, Node Movie connects to User with relationship hasRating. I'm trying to retrieve all movies which are most highly co-rated (co-rated > 3.0) with a query (e.g. Toy Story) that share all genres with Toy Story. Here's my Cypher query:
MATCH (inputMovie:Movie {movieId: 1})-[r:hasGenre]-(h:Genre)
WITH inputMovie, COLLECT (h) as inputGenres
MATCH (inputMovie)<-[r:hasRating]-(User)-[o:hasRating]->(movie)-[:hasGenre]->(genre)
WITH inputGenres, r, o, movie, COLLECT(genre) AS genres
WHERE ALL(h in inputGenres where h in genres) and (r.rating>3 and o.rating>3)
RETURN movie.title,movie.movieId, count(*)
ORDER BY count(*) DESC
However, it seems that my system cannot handle it (using 16GB of RAM, Core i7 4th gen, and SSD). When I'm running the query, it peaks to 97% of RAM then Neo4j shutdowns unexpectedly (probably due to heap size or else, due to RAM size).
- Do I make the query correct? I'm newbie in Neo4j so probably I make the query incorrectly.
- Please suggest how to optimize such query?
- How can I optimize the Neo4j so it can handle large dataset with my system's spec according to the query?
Thanks in advance.