Neo4j: performance issue finding all paths between two nodes with CYPHER

Question

I'm using TheMovieDB database downloaded here. It has ~60k nodes and ~100k relationships and I need to find all the paths of a given length k between two nodes a and b with a given name property. Let's say I need to find all the path of lenght 2 between Keanu Reeves and Laurence Fishburne. I used the following CYPHER query:

MATCH (k)-[e*2..2]-(l)
WHERE k.name = "Keanu Reeves" AND l.name = "Laurence Fishburne"
RETURN k,e,l

and it took 40 seconds.

I decided to try a different approach and used the following query instead:

MATCH (k)--(m)--(l)
WHERE k.name = "Keanu Reeves" AND l.name = "Laurence Fishburne"
RETURN k,m,l

and it took 252 milliseconds!

Those two queries gave the same results, had the same meaning and yet the first one took 200x more time. How is that possibile?

I need to conduct some tests in which I have to find all the paths with a given maximum (but not a minimum) length between two given nodes. This gives me some problems because I cannot use the second approach I described (it works only with a fixed lenght path) and the first one is waaaay too slow.

I also cannot use allShortestPath because it doesn't return any path whose length is greater than the shorter one.

It's driving me crazy... Any idea how to solve it?

Edit

Another example of how big this issue is: finding a path of lenght 4 between Robert Downey Jr. and Harrison Ford. Method #2: ~500 milliseconds Method #1: >360 seconds (after those 6 minutes I brutally unplugged the pc power adaptor)

Dave Bennett Dave Bennett · Accepted Answer · 2015-04-24T18:13:37

The reason your first query is taking so long is because it is not using any indexes at all; you are scanning the entire database.

If you change your query slightly to include the Actor label in the path you are matching you will significantly improve the query performance.

MATCH (k)-[e*2..2]-(l)
WHERE k.name = "Keanu Reeves" AND l.name = "Laurence Fishburne"
RETURN k,e,l

If you reveal the indexes by executing the :schema command in the browser you will see the indexes that are in place. You can see that the first one is on :Actor(name); withing the Actor label the name property is indexed.

Indexes
  ON :Actor(name)    ONLINE                             
  ON :Director(name) ONLINE                             
  ON :Movie(title)   ONLINE                             
  ON :Person(name)   ONLINE                             
  ON :User(login)    ONLINE (for uniqueness constraint) 

Constraints
  ON (user:User) ASSERT user.login IS UNIQUE

If you profile your query

profile
MATCH (k)-[e*2..2]-(l)
WHERE k.name = "Keanu Reeves" AND l.name = "Laurence Fishburne"
RETURN k,e,l

and then profile the one with the :Actor label added it will be abundantly clear why the two perform differently.

profile
MATCH (k:Actor)-[e*2..2]-(l:Actor)
WHERE k.name = "Keanu Reeves" AND l.name = "Laurence Fishburne"
RETURN k,e,l

I forgot to add that you should also profile your second ( faster ) query:

profile
MATCH (k)--(m)--(l)
WHERE k.name = "Keanu Reeves" AND l.name = "Laurence Fishburne"
RETURN k,m,l

You will see that the query plans are significantly different. I think simply adding an asterisk to a relationship probably sends the database engine down a different optimization path.

Good luck!

Neo4j: performance issue finding all paths between two nodes with CYPHER

1 Answers