Given node A find all nodes in A's subgraph with linear time complexity in Neo4j

Question

I am in the process of evaluating Neo4j for use in a production environment and I have encountered some difficulty when doing something that I expected to be simple. I have managed to solve it, but in a suboptimal and quite complicated way so I was thinking if there might be a simpler way to accomplish the same thing.

Preamble

Neo4j version 2.3.2

TL;DR

This is a bit of a long explanation so the summary is as follows:

Given node A, I need to find all nodes in the subgraph of A with a complexity of O(number_of_vertices + number_of_edges).

Problem

For our use case we have a graph that, with respect to one particular type of relationship, is broken up into smaller disconnected subgraphs of no more than a couple of dozen nodes each. What we are trying to accomplish is, given an indexed id from one of those nodes discover all nodes in the subgraph (while treating the graph as being undirected). The other peculiarity is that our nodes always have the reflexive edge from every node back to itself.

Algorithmically all we need is a breadth first search with an expected complexity of O(num_of_vertices + num_of_edges). Since our graphs are not dense the number of edges in them is roughly linear to the number of number of vertices so the overall complexity should be linear to the number of vertices in it.

Test graph

For simplicity I have made this test graph a fully connected one. Since the point is to compare cypher queries this does not affect the results.

Commands:

CREATE (:Label { id: 1 }),(:Label { id: 2 }),(:Label { id: 3 }),(:Label { id: 4 })
MATCH (a),(b) MERGE (a)-[:REL]->(b)

Simple query

The first query I tried to get the desired results is the following:

MATCH (a:Label {id:1})-[:REL*1..]-(b:Label) RETURN DISTINCT b

That query never terminated. When I added an upper bound for the relationship pattern and profiled the query I got the following:

Depth 4: 3902 db accesses
Depth 8: 1714982 db accesses

So instead of being linear to the number of vertices and edges this looks like it's finding all possible paths which of course explodes with depth.

Better performing query

To accomplish this I wrote the following query instead:

https://gist.github.com/Dalamar42/1ec93cd74b01c145e7bd

(This will search to depth 2. Duplicate lines 6-16 to search to depth 4, 6, 8, and so on)

The query does the following:

Get the entry node
Add that node to a collection of nodes_found and another of nodes_to_visit
For every node A in nodes_to_visit follow its edges to node B
From nodes B keep only nodes that are not in nodes_found as nodes C
Set nodes_to_visit to nodes C and set nodes_found to its previous value plus nodes C
Repeat to the desired depth

This query should almost have the complexity of BFS except for one complexity. From what I understand every intermediate MATCH/WHERE needs to match at least one node otherwise cypher returns empty disregarding nodes found in the previous steps. I worked around this by changing step 4 to:

"From nodes B keep only node A and nodes that are not in nodes_found as nodes C"

Because all nodes have the reflexive edge, node A will always be in the set of nodes B and by always keeping it I make sure that that part of the query will always match at least one node.

This means this query has the following problems:

I need to define the max search depth
Increasing depth does not come for free since I am exploring the edges of A every time
This query is a pain for someone to read, especially given that this will actually be only part of a bigger query that has to do this a couple of times

The upside of this query is that I am getting much better performance

For searching to depth 4: 65 db accesses (compared to 3902 in the "dumb" query)
For searching to depth 6: 97 db accesses (compared to 1714982 in the "dumb" query)

Better solution? Does anyone know of a better/simpler solution? Am I missing some obvious feature of Cypher? I wasn't able to find anything going through the documentation.

Thanks

cybersam cybersam · Accepted Answer · 2016-03-12T20:29:42

[UPDATED]

Here is a simpler approach that is similar to yours, but it uses the COALESCE function to avoid having to artificially add the "entry node" into every "rim node" collection. (By "rim node", I mean a previously unencountered node that was found in the most recent match.)

The query assumes you have created an index on :Label(id) to speed up the first MATCH. The top section is just for getting the "entry node" and initializing the "res" (or result) and "rim" collections. The subsequent sections are just exact copies of each other, and can be repeated to match the desired depth of your search.

With a depth of 2, as shown here, only 40 db hits are expended.

Note1: Given your test data, only a depth of 1 is needed. In that case, only 16 DB hits are expended.

Note2: The 3rd WITH clause in each section is there to force a NULL into an empty rim collection. This is because UNWIND will abort the query if it is asked to unwind an empty collection. For some reason, passing [NULL] instead of [] to COALLESCE() does not work as expected.

MATCH (a:Label { id:1 })
USING INDEX a:Label(id)
WITH COLLECT(a) AS res
WITH res, res AS rim

UNWIND rim AS a
OPTIONAL MATCH (a)-[:REL]-(b:Label)
WHERE NOT b IN res
WITH res, COALESCE(COLLECT(DISTINCT b),[]) AS rim
WITH rim, res + rim AS res
WITH res, CASE rim WHEN [] THEN [NULL] ELSE rim END AS rim

UNWIND rim AS a
OPTIONAL MATCH (a)-[:REL]-(b:Label)
WHERE NOT b IN res
WITH res, COALESCE(COLLECT(DISTINCT b),[]) AS rim
WITH rim, res + rim AS res
WITH res, CASE rim WHEN [] THEN [NULL] ELSE rim END AS rim

RETURN res;

Given node A find all nodes in A's subgraph with linear time complexity in Neo4j

2 Answers