Neo4j crashes on 4th degree Cypher query

Question

neo4j noob here, on Neo4j 2.0.0 Community

I've got a graph database of 24,000 movies and 2700 users, and somewhere around 60,000 LIKE relationships between a user and a movie.

Let's say that I've got a specific movie (movie1) in mind.

START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04") 
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)
RETURN usersLikingMovie1;

I can quickly and easily find the users who liked the movie with the above query. I can follow this path further to get the users who liked the same movies that as the people who liked movie1. I call these generation 2 users

START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04") 
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)
RETURN usersGen2;

This query takes about 3 seconds and returns 1896 users.
Now I take this query one step further to get the movies liked by the users above (generation 2 movies)

START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04") 
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)-[:LIKES]->(moviesGen2)
RETURN moviesGen2;

This query causes neo4j to spin for several minutes at 100% cpu utilization and using 4GB of RAM. Then it sends back an exception "OutOfMemoryError: GC overhead limit exceeded".

I was hoping someone could help me out and explain to me the issue. Is Neo4j not meant to handle a query of this depth in a performant manner? Is there something wrong with my Cypher query?

Thanks for taking the time to read.

sorry, cannot share it. It's not really a movie database anyway. — HaterTot

cod3monk3y cod3monk3y · Accepted Answer · 2013-12-13T02:37:21

That's a pretty intense query, and the deeper you go the closer you're probably getting to a set of all users that ever rated any movie, since you're essentially just expanding out through the graph in tree form starting with your given movie. @Huston's WHERE and DISTINCT clauses will help to prune branches you've already seen, but you're still just expanding out through the tree.

The branching factor of your tree can be estimated with two values:

u, the average number of users that liked a movie (incoming to :Movie)
m, the average number of movies that each user liked (outgoing from :User)

For an estimate, your first step will return m users. On the next step, for each user you get all the movies each of them liked followed by all the users that liked all of those movies:

gen(1) => u
gen(2) => u * (m * u)

For each generation you'll tack on another m*u, so your third generation is:

gen(3) => u * (m * u) * (m * u)

Or more generically:

gen(n) => u^n * m^(n-1)

You could estimate your branching factors by computing the average of your likes/users and likes/movie, but that's probably very inaccurate since it gives you 22.2 likes/user and 2.5 likes/movie. Those numbers aren't reasonable for any movie that's worthy of rating. A better approach would be to take the median number of ratings or look at a histogram and use the peaks as your branching factors.

To put this in perspective, the average Netflix user rated 200 movies. The Netflix Prize training set had 17,770 movies, 480,189 users, and 100,480,507 ratings. That's 209 ratings/user and 5654 ratings/movie.

To keep things simple (and assuming your data set is much smaller), let's use:

m = 20 movie ratings/user 
u = 100 users have rated/movie

Your query in gen-3 (without distincts) will return:

gen(3) = 100^3 * 20^2
       = 400,000,000

400 million nodes (users)

Since you only have 2700 users, I think it's safe to say your query probably returns every user in your data set (rather, 148 thousand-ish copies of each user).

Your movie nodes in ASCII -- (n:Movie {movieid:"88cacfca-3def-4b2c-acb2-8e7f4f28be04"}) are 58 bytes minimum. If your users are about the same, let's say each node is 60 bytes, your storage requirement for this resultant set is:

400,000,000 nodes * 60 bytes
= 24,000,000,000 bytes
= 23,437,500 kb
= 22,888 Mb
= 22.35 Gb

So by my conservative estimates, your query requires 22 Gigabytes of storage. This seems quite reasonable that Neo4j would run out of memory.

My guess is that you're trying to find similarities in the patterns of users, but the query you're using is returning all the users in your dataset duplicated a bunch of times. Maybe you want to be asking questions of your data more like:

what users rate movies most like me?
what users rated most of the same movies as I rated
what movies have users that have rated similar movies to me watched that I haven't watched yet?

Cheers, cm

Neo4j crashes on 4th degree Cypher query

3 Answers