5
votes

I am using neo4j graph database version 2.1.7. Brief Details around data: 2 million nodes with 6 different type of nodes, 5 million relationships with only 5 different type of relationships and mostly connected graph but contains a few isolated subgraphs.

While resolving paths, i get cycles in path. And to restrict that, i used the solution shared in below: Returning only simple paths in Neo4j Cypher query

Here is the Query, i am using:

MATCH (n:nodeA{key:905728}) 
MATCH path = n-[:rel1|rel2|rel3|rel4*0..]->(c:nodeA)-[:rel5*0..1]->(b:nodeA) 
WHERE ALL(a in nodes(path) where 1=length (filter (m in nodes(path) where m=a))) 
and (length(EXTRACT (p in NODES(path)| p.key)) > 1) 
and ((exists ((c)-[:rel5]->(b)) and (not exists((b)-[:rel1|rel2|rel3|rel4]->(:nodeA)) OR ANY (x in nodes(path) where (b)-[]->(x))))
    OR (not exists ((c)-[:rel5]->()) and (not exists ((c)-[:rel1|rel2|rel3|rel4]->(:nodeA)) OR ANY (x in nodes(path) where (c)-[]->(x))))) 
RETURN distinct EXTRACT (rp in Rels(path)| type(rp)), EXTRACT (p in NODES(path)| p.key);

The above query solves mine requirement but is not cost effective and keeps running if is run for huge subgraph. I have used 'Profile' command to improve query performance from what i started with. But, now stuck at this point. The performance has improved but, not what i expected from neo4j :(

2

2 Answers

2
votes

I don't know that I have a solution, but I have a number of suggestions. Some might speed things up, some might just make the query easier to read.

Firstly, rather than putting exists ((c)-[:rel5]->(b)) in your WHERE, I believe you can put it in your MATCH like this:

MATCH path = n-[:rel1|rel2|rel3|rel4*0..]->(c:nodeA)-[:rel5*0..1]->(b:nodeA), (c)-[:rel5]->(b)

I don't think you need the exists keyword. I think you can just say, for example, (NOT (b)-[:rel1|rel2|rel3|rel4]->(:nodeA))

I'd also suggest thinking about the WITH clause for potential performance improvements.

A couple of notes about your variable paths: In *0.. the 0 means that your potentially looking for a self-reference. That may or may not be what you want. Also, leaving the variable path open ended can often cause performance problems (as I think you're seeing). If you can possibly cap it that may help.

Also, if you upgrade to 2.2.1, there are a number of built-in performance improvements with the 2.2.x line, but you also get visual PROFILEing in the console and a new EXPLAIN command which both profiles and tells you the real performance of the query after running it.

One thing to consider too is that I don't think you're hitting performance boundaries of Neo4j but rather, perhaps, you're potentially hitting some boundaries of Cypher. If so, I might suggest you do your querying with the Java APIs that Neo4j provides for better performance and more control. This can either be via embedding your database if you're using a JVM-compatible language or by writing an unmanaged extension which lets you do your own querying in java but provide a custom REST API from the server

1
votes

Did a couple of more tweaks to my query as suggested above by Brian. And found improvement in query response time. Now, It takes almost 20% of time in execution compared to my original query and the current query makes almost 60% less db hits, compared to the query i shared earlier, during query execution. PFB the updated query:

MATCH (n:nodeA{key:905728}) 
MATCH path = n-[:rel1|rel2|rel3|rel4*1..]->(c:nodeA)-[:rel5*0..1]->(b:nodeA) 
WHERE ALL(a in nodes(path) where 1=length (filter (m in nodes(path) where m=a))) 
and (length(path) > 0) 
and ((exists ((c)-[:rel5]->(b)) and (not ((c)-[:rel1|rel2|rel3|rel4]->()) OR ANY (x in nodes(path) where (c)-[]->(x))))
    OR (not exists ((c)-[:rel5]->()) and (not ((c)-[:rel1|rel2|rel3|rel4]->()) OR ANY (x in nodes(path) where (c)-[]->(x))))) 
RETURN distinct EXTRACT (rp in Rels(path)| type(rp)), EXTRACT (p in NODES(path)| p.key);

And observed dramatic improvement when capped the path from *1.. to *1..15. Also, removed one filter from query which too was taking longer time. But, the query response time increased when queried on nodes having relationships more than 18-20 depths.

I would advise to use profile command oftenly to find pain points in your query. That would help you resolve the issues faster. Thanks Brian.