0
votes

I have a graph with millions of nodes and relations. I need to get all paths between two nodes. In my case all of nodes in relation paths must be same label

my query like this;

match (n:Label) match (m:Label) where n.NAME='foo' and m.NAME='foo2'
match p=(n)-[r*..20]-(m) where all(x in nodes(p) where (x:Label))
with p
return p, EXTRACT(x IN nodes(p) | x.NAME), EXTRACT(r IN relationships(p) | type(r))

Node count with label "Label" is about 20 but this query traverse in all graph to find all possible paths between two nodes and then trying reduce paths with my "where all" clause. It crashes my db then.

I need to get all nodes with label name "Label" and their relations, then query paths between subgraph to reduce cost.

2
This is a computationally difficult query. However, a couple of things to try: 1. if possible, restrict the relationship type, i.e. put r:REL_TYPE1|REL_TYPE2 (listing the possible relationship types) to the MATCH 2. check out the APOC library's path expander.Gabor Szarnyas

2 Answers

1
votes

There are a number of Path Expander APOC procedures that should be helpful, since many allow you to specify a label filter when generating paths.

For example:

MATCH (n:Label {NAME: 'foo'})
WITH COLLECT(n) AS ns1
MATCH (m:Label {NAME: 'foo2'})
WITH ns1 + COLLECT(m) AS startNodes
CALL apoc.path.expandConfig(
  startNodes,
  {labelFilter: '+Label', minLevel: 1, maxLevel: 20}
) YIELD path
RETURN
  path,
  [x IN nodes(path) | x.NAME] AS names,
  [r IN relationships(path) | TYPE(r)] AS types;
0
votes

Trying to find all possible paths (even if up to length of 20) in a graph of millions of nodes is likely to cause a memory overflow.

What you can do is break it down into smaller segments. The query won't be as elegant, but it should work.

For example, if we are doing a path length of 5 at a time, a two-segment query will look like this:

 MATCH p1 = (n1:Label)-[r1*..5]-(n2:Label), p2 = (n2:Label)-[r2*..5]-(n3:Label) 
 WHERE all(x1 in nodes(p1) WHERE (x1:Label)) 
 AND all(x2 in nodes(p2) WHERE (x2:Label)) 
 RETURN r1, r2

The cost plan for this query looks like so:

 +-----------------------+----------------------+---------------------+
 | Operator              | Variables            | Other               |
 +-----------------------+----------------------+---------------------+
 | +ProduceResults       | r1                   | r1                  |
 | |                     +----------------------+---------------------+
 | +Filter               | n1, n2, n3, r1, r2   | [see below]         |
 | |                     +----------------------+---------------------+
 | +VarLengthExpand(All) | n1, r1 -- n2, n3, r2 | (n2)-[r1:*..5]-(n1) |
 | |                     +----------------------+---------------------+
 | +Filter               | n2, n3, r2           | n3:Label            |
 | |                     +----------------------+---------------------+
 | +VarLengthExpand(All) | n3, r2 -- n2         | (n2)-[r2:*..5]-(n3) |
 | |                     +----------------------+---------------------+
 | +NodeByLabelScan      | n2                   | :Label              |
 +-----------------------+----------------------+---------------------+

So you can see that straight after the first expand a filter will filter any paths that don't start and end with :Label and only then the second expand will happen.

So long your Neo version is 2.2 or above, p1 and p2 will not include the same relationships.

You can actually see this filtering done in the filter preceding the ProduceResults operator (second line):

all(x1 in NodesFunction(ProjectedPath(Set(r1, n1),)) where x1:Label) 
AND none(r1 in r1 where any(r2 in r2 where r1 == r2)) 
AND n1:Label

Now you should also see that the we only check for all nodes in the path on that last filter. So a path like this: (a:Label)--(b:Blah)--(c:Label) will still past the first segment and only be filtered before the result production.

So you can further optimise by checking that all segment nodes have :Label and then do a manual check for relationship similar past relationships. Only showing the second stage:

WITH n2, r1
MATCH p2 = (n2:Label)-[r2*..5]-(n3:Label)
WHERE all(x2 in nodes(p2) WHERE (x2:Label))
AND none(r1 in r1 where any(r2 in r2 where r1 == r2))

I forgot to mention, but remember that queries as such are executed in a lazy manner.