5
votes

I have a graph with 0.5 billion of nodes and edges in Neo. I want to find shortest path between 2 nodes that avoids supernodes (even if it is longer than paths having supernodes on them).

The below query works fine for smaller graphs, but never finishes for the graph of the size I am dealing with:

MATCH (n:Node { id:'123'}),(m:Node { id:'234' }), p = shortestPath((n)-[*..6]-(m)) 
WHERE NONE(x IN NODES(p) WHERE size((x)--())>1000)
RETURN p

If I remove the WHERE clause it is super fast. Typically subsecond.

How can I speed it up? Would precalculating node degrees and indexing them help? Should I resort to duplicating all the edges apart from the ones adjacent to supernodes, giving them a new label and using them for my shortestPath query without the WHERE clause? Any other suggestions?

2
What happens if you remove the label? WHERE NONE(x IN NODES(p) WHERE size((x)--()) > 1000)Nicole White
Good point. Apologies, I was actually testing it with no labels in the WHERE clause. With the first label it errors. The second label doesn't seem to make any difference. Let me update my question to remove the labels. For reference it originally looked like this: WHERE NONE(x:Node IN NODES(p) WHERE size((x:Node)--())>1000)Tom

2 Answers

2
votes

As far as I can tell the Neo4j shortest path implementation prunes paths when the WHERE ALL contains relationships only (not nodes). Where it cannot prune the queries it finds all the paths then filters them (slow).

As Martin says you can add a label:

MATCH (x:Node)
WHERE size((x)--())>1000
SET n:Supernode

And then interrogate the nodes' label via the edges:

MATCH p = shortestPath((n:Node { id:'1'})-[*..6]-(m:Node { id:'2' })) 
WHERE ALL( rel IN relationships(p) WHERE not (startNode(rel):Supernode or endNode(rel):Supernode))
RETURN p

This will allow Neo4j to use the optimised, bi-directional, breadth-first (fast) query.

Some more reading here: https://neo4j.com/docs/developer-manual/current/cypher/execution-plans/shortestpath-planning/

2
votes

You could also try to add a label for supernodes:

MATCH (x:Node)
WHERE size((x)--())>1000
SET n:Supernode

Does this run and finish on your data? How many supernodes and normal nodes do you have?

Then try:

MATCH (n:Node { id:'123'}),(m:Node { id:'234' })
WITH n, m
MATCH p = shortestPath((n)-[*..6]-(m))
WHERE NONE(x IN NODES(p) WHERE (x:Supernode))
RETURN p

I suppose a label check is faster.