I am in the process of evaluating Neo4j for use in a production environment and I have encountered some difficulty when doing something that I expected to be simple. I have managed to solve it, but in a suboptimal and quite complicated way so I was thinking if there might be a simpler way to accomplish the same thing.
Preamble
Neo4j version 2.3.2
TL;DR
This is a bit of a long explanation so the summary is as follows:
Given node A, I need to find all nodes in the subgraph of A with a complexity of O(number_of_vertices + number_of_edges).
Problem
For our use case we have a graph that, with respect to one particular type of relationship, is broken up into smaller disconnected subgraphs of no more than a couple of dozen nodes each. What we are trying to accomplish is, given an indexed id from one of those nodes discover all nodes in the subgraph (while treating the graph as being undirected). The other peculiarity is that our nodes always have the reflexive edge from every node back to itself.
Algorithmically all we need is a breadth first search with an expected complexity of O(num_of_vertices + num_of_edges). Since our graphs are not dense the number of edges in them is roughly linear to the number of number of vertices so the overall complexity should be linear to the number of vertices in it.
Test graph
For simplicity I have made this test graph a fully connected one. Since the point is to compare cypher queries this does not affect the results.
Commands:
- CREATE (:Label { id: 1 }),(:Label { id: 2 }),(:Label { id: 3 }),(:Label { id: 4 })
- MATCH (a),(b) MERGE (a)-[:REL]->(b)
Simple query
The first query I tried to get the desired results is the following:
- MATCH (a:Label {id:1})-[:REL*1..]-(b:Label) RETURN DISTINCT b
That query never terminated. When I added an upper bound for the relationship pattern and profiled the query I got the following:
- Depth 4: 3902 db accesses
- Depth 8: 1714982 db accesses
So instead of being linear to the number of vertices and edges this looks like it's finding all possible paths which of course explodes with depth.
Better performing query
To accomplish this I wrote the following query instead:
https://gist.github.com/Dalamar42/1ec93cd74b01c145e7bd
(This will search to depth 2. Duplicate lines 6-16 to search to depth 4, 6, 8, and so on)
The query does the following:
- Get the entry node
- Add that node to a collection of nodes_found and another of nodes_to_visit
- For every node A in nodes_to_visit follow its edges to node B
- From nodes B keep only nodes that are not in nodes_found as nodes C
- Set nodes_to_visit to nodes C and set nodes_found to its previous value plus nodes C
- Repeat to the desired depth
This query should almost have the complexity of BFS except for one complexity. From what I understand every intermediate MATCH/WHERE needs to match at least one node otherwise cypher returns empty disregarding nodes found in the previous steps. I worked around this by changing step 4 to:
"From nodes B keep only node A and nodes that are not in nodes_found as nodes C"
Because all nodes have the reflexive edge, node A will always be in the set of nodes B and by always keeping it I make sure that that part of the query will always match at least one node.
This means this query has the following problems:
- I need to define the max search depth
- Increasing depth does not come for free since I am exploring the edges of A every time
- This query is a pain for someone to read, especially given that this will actually be only part of a bigger query that has to do this a couple of times
The upside of this query is that I am getting much better performance
- For searching to depth 4: 65 db accesses (compared to 3902 in the "dumb" query)
- For searching to depth 6: 97 db accesses (compared to 1714982 in the "dumb" query)
Better solution? Does anyone know of a better/simpler solution? Am I missing some obvious feature of Cypher? I wasn't able to find anything going through the documentation.
Thanks