How to improve performance on variable length Neo4j Cypher query?

Question

I'm querying Neo4j in a Java Spring Boot application using neo4j-java-driver to connect to the bolt port but my query is taking approximately 30 minutes to return the results.

The query:

MATCH path=(:JAVA {snapshot: 3})-[*]->()
UNWIND nodes(path) as n
WITH DISTINCT n
SET n.scope = 'JAVA'
RETURN n.ID

I've tried searching online for optimization techniques as well as APOC functions but nothing I've attempted so far is improving the performance. The labels are indexed. Snapshot is a property that is present on all nodes and ID is a separate identification that is needed for unrelated reasons.

Graph Information

200K nodes
355K Relationships
9073 nodes of type JAVA
61K direct relationships outgoing from nodes of type JAVA
dbms.memory.heap.initial_size=3G
dbms.memory.heap.max_size=4G
dbms.memory.pagecache.size=1G

I'm essentially trying to traverse a program call chain where the start of the chain is a node of type JAVA. If any other node is reachable from a node of type JAVA then I want to set its scope and return its ID. What I think is happening is that the graph is pretty dense with common path traversals and the query is traversing the same path more than once. I'm not sure I can prevent this or if Neo4j handles that issue internally.

From Java I'm accessing the driver (The driver is instantiated when the application is started) and executing the query and collecting the IDs from the results.

try (final Session session = getDriver().session()) {
    session.run(new Statement("<The query>")).stream()
        .map(record -> Long.valueOf(record.get(0).asLong()))
        .collect(Collectors.toList());
...

EDIT, follow up to questions in comments with more data. Distinct dependencies of nodes with JAVA label.

MATCH (:JAVA {snapshot: 3})-[*]->(n) RETURN count(DISTINCT n)

returns 182,749

Profile of query plan

InverseFalcon InverseFalcon · Accepted Answer · 2019-08-14T02:23:12

We can certainly test that analysis.

Keep in mind that your usage of UNWINDing the path nodes is definitely not efficient here, there will be tons of repeats, even if all of the end nodes of the path are distinct, since any nodes present in a subpath will be present in paths extending from that subpath.

A better version of your query be:

MATCH path=(:JAVA {snapshot: 3})-[*]->(n)
WITH DISTINCT n
SET n.scope = 'JAVA'
RETURN n.ID

But if there are multiple paths to the same node (if you examined the PROFILE plan of that query and saw a pretty big gap between the rows after the DISTINCT operation vs before) then this seems like a good case for using APOC path expanders, as we can configure them to use a traversal uniqueness behavior that should only visit any distinct node once throughout all expansions.

If your query is getting hung up because it's revisiting the same nodes and paths over and over, then this should be a help.

Try this:

MATCH (start:JAVA {snapshot: 3})
CALL apoc.path.subgraphNodes(start, {relationshipFilter:'>'}) YIELD node as n
WITH n
SKIP 1 // so we don't apply this to the start node
SET n.scope = 'JAVA'
RETURN n.ID

How to improve performance on variable length Neo4j Cypher query?

1 Answers