How to efficiently delete nodes which can be reached from another node without passing other nodes and only have 1 incoming relationship?

Question

I'm using Property Graph and Cypher from Neo4j. As described in the title, I'm trying to delete a number of nodes which can be reached from another node without passing other nodes and only have 1 incoming relationship. Here is the example of this case:

Each node has its label (big, bold character) and a property called nodeId, which is unique among nodes. The labels of relationships are omitted because we cannot rely on it for some reasons. The nodeId property is already indexed with a unique constraint.

Now, starting from node A {nodeId: 1}, I want to delete it and all other nodes which:

can be reached from A {nodeId: 1} without passing another A-label node.
only has 1 incoming relationship

So, the nodes will be deleted are: A {nodeId: 1}, B {nodeId: 3}, C {nodeId: 4}, and C {nodeId: 8}.

Below is my Cypher code:

MATCH p = (s:A {nodeId: 1 }) -[*1..10]-> (e)
WHERE NONE (x in NODES(p) WHERE x:A AND NOT x.nodeId = 1)
WITH s, e
MATCH (e) <-[r]- ()
WITH count(r) AS num_r, s, e
WHERE num_r < 2
DETACH DELETE e
DETACH DELETE s

The code works fine but as my graph grows, it becomes slower and slower. In the beginning, it takes less than 10 ms. But now, when I have around 1 million of nodes and 2 million of relationships, it takes more than 1 second.

What should I do to improve the performance of that code?

Thank you for your help.

If C:4 has a second incoming relationship, does C:8 still get deleted? Would it make a difference in what we delete if that relationship was also coming from the primary node we are deleting? (vs any other node) — Tezra
Ah, in this problem, only nodes at the end of the path (C:8, D:5, B:6, B:7) can have more than one incoming relationship. I don't understand your second question. Could you give me an example? — Triet Doan
Add the edge from A:1 or B:3 to C:8. So C:8 has more the 1 parent, but both parents will be deleted if you delete A:1. — Tezra
In the meantime, it sounds like you want a type of query that Cypher is very weak at. I would recommend looking at the Neo4j traversal framework It's far more efficient at non-trivial queries like this one. — Tezra

Tezra Tezra · Accepted Answer · 2018-05-22T20:53:47

Since you only care if there is A path, you should use shortestPath instead of just (a)-[*]->(b). That way, Cypher just needs to find 1 valid path instead of all possible paths (This can be a life saver in larger sets) Also, you can use TAIL to cut off the first item in a list so that you can (Cypher can) skip that check.

Depending on your Neo4j version, Using MATCH <path> WHERE <stuff> WITH DISTINCT startnode ,endnode may be more effective, as later Cypher Planners can use the WITH DISTINCT hint to do a faster, less exhaustive path matching. On earlier versions, this will hang Neo4j, and you will need to use the APOC neo4j library.

MATCH (s:A {nodeId: 1 })
WITH s MATCH p=shortestPath((s)-[*1..10]->(e))
WHERE NONE (x in TAIL(NODES(p)) WHERE x:A) AND NOT ()-->(e)<--()
WITH DISTINCT s, e
DETACH DELETE e
DETACH DELETE s

You can also change NOT ()-->(e)<--() to SIZE(()-->(e)) < 2 if you need to change that number. The former may perform better in some Cypher Planners though. You may need to change that to "All parents of e are contained in path" if that is a scenario where e can have more than 2 incoming relationships but still need to be deleted.

If your logic gets more complicated than that (where what nodes get deleted can change what other nodes can be deleted

How to efficiently delete nodes which can be reached from another node without passing other nodes and only have 1 incoming relationship?

2 Answers