0
votes

Given a graph in Neo4j that is directed (but possible to have cycles), how can I retrieve all nodes that are reachable from a specific node with Cypher?

(Also: how long can I expect a query like this to take if my graph has 2 million nodes, and by extension 48 million nodes? A rough gauge will do eg. less than a minute, few minutes, an hour)

2

2 Answers

2
votes

Cypher's uniqueness behavior is that relationships must be unique per path (each relationship can only be traversed once per path), but this isn't efficient for these kinds of use cases, where the goal is instead to find distinct nodes, so a node should only be visited once total (across all paths, not per path).

There are some path expander procedures in the APOC Procedures library that are directed at these use cases.

If you're trying to find all reachable nodes from a starting node, traversing relationships in either direction, you can use apoc.path.subgraphNodes() like so, using the movies graph as an example:

MATCH (n:Movie {title:"The Matrix"})
CALL apoc.path.subgraphNodes(n, {}) YIELD node
RETURN node

If you only wanted reachable nodes going a specific direction (let's say outgoing) then you can use a relationshipFilter to specify this. You can also add in the type too if that's important, but if we only wanted reachable via any outgoing relationship the query would look like:

MATCH (n:Movie {title:"The Matrix"})
CALL apoc.path.subgraphNodes(n, {relationshipFilter:'>'}) YIELD node
RETURN node

In either case these approaches should work better than with Cypher alone, especially in any moderately connected graph, as there will only ever be a single path considered for every reachable node (alternate paths to an already visited node will be pruned, cutting down on the possible paths to explore during traversal, which is efficient as we don't care about these alternate paths for this use case).

0
votes

Have a look here, where an algorithm is used for community detection.

You can use something like

match (n:Movie {title:"The Matrix"})-[r*1..50]-(m) return distinct id(m)

but that is slow (tested on the Neo4j movie dataset with 60k nodes, above already runs more than 10 minutes. Probably memory usage will become an issue when you have a dataset consisting out of millions of nodes. Next to that, it also depends how your dataset is connected, e.g. nr of relationships.