I'm querying Neo4j in a Java Spring Boot application using neo4j-java-driver to connect to the bolt port but my query is taking approximately 30 minutes to return the results.
The query:
MATCH path=(:JAVA {snapshot: 3})-[*]->()
UNWIND nodes(path) as n
WITH DISTINCT n
SET n.scope = 'JAVA'
RETURN n.ID
I've tried searching online for optimization techniques as well as APOC functions but nothing I've attempted so far is improving the performance. The labels are indexed. Snapshot is a property that is present on all nodes and ID is a separate identification that is needed for unrelated reasons.
Graph Information
- 200K nodes
- 355K Relationships
- 9073 nodes of type JAVA
- 61K direct relationships outgoing from nodes of type JAVA
- dbms.memory.heap.initial_size=3G
- dbms.memory.heap.max_size=4G
- dbms.memory.pagecache.size=1G
I'm essentially trying to traverse a program call chain where the start of the chain is a node of type JAVA. If any other node is reachable from a node of type JAVA then I want to set its scope and return its ID. What I think is happening is that the graph is pretty dense with common path traversals and the query is traversing the same path more than once. I'm not sure I can prevent this or if Neo4j handles that issue internally.
From Java I'm accessing the driver (The driver is instantiated when the application is started) and executing the query and collecting the IDs from the results.
try (final Session session = getDriver().session()) {
session.run(new Statement("<The query>")).stream()
.map(record -> Long.valueOf(record.get(0).asLong()))
.collect(Collectors.toList());
...
EDIT, follow up to questions in comments with more data. Distinct dependencies of nodes with JAVA label.
MATCH (:JAVA {snapshot: 3})-[*]->(n) RETURN count(DISTINCT n)
returns 182,749
Profile of query plan