0
votes

Now I have a graph with millions of nodes and millions of edge relationships. There is a directed relationship between nodes.

Now suppose the node has two states A and B. I want to find all state A nodes on the path that do not have state B.

As shown in the figure below, there are nodes A--K, and then three of them, E, G and J, are of type B, and the others are of type A. picture link is https://i.stack.imgur.com/a0yOV.jpg

For node E, its upstream and downstream traversal is shown below, so nodes B, H, K do not meet the requirements.

For node G, its upstream and downstream traversal is shown below, so nodes B, D, K do not meet the requirements.

For node J, its upstream and downstream traversal is shown below, so nodes A, B, C, D, F do not meet the requirements.

So finally only node "I" is the node that meets the requirements. picture link is https://i.stack.imgur.com/A2eqv.jpg

The case of the above example is a DAG, but the actual situation is that there may be cycle in the graph, including spin cycle (case 1), AB cycle (case 2), large loops (case 3), and complex cycle (case 4) picture link is https://i.stack.imgur.com/NDpED.jpg The Cypher query statement I can write

MATCH (n:A) 
WHERE NOT exists((n)-[*]->(:B)) 
  AND NOT exists((n)<-[*]-(:B))
RETURN n;

But this query statement is stuck in the case of millions of nodes and millions of edges with a limit 35,But in the end there are more than 30,000 nodes that meet the requirements.

Obviously my statement is taking up too much memory, querying out 30+ nodes has taken up almost all the available memory, how can I write a more efficient query?

Here is a example

CREATE (a:A{id:'a'}) 
CREATE (b:A{id:'b'})
CREATE (c:A{id:'c'}) 
CREATE (d:A{id:'d'})
CREATE (e:B{id:'e'}) 
CREATE (f:A{id:'f'})
CREATE (g:B{id:'g'}) 
CREATE (h:A{id:'h'})
CREATE (i:A{id:'i'}) 
CREATE (j:B{id:'j'})
CREATE (k:A{id:'k'})
MERGE (a)-[:REF]->(c)  
MERGE (b)-[:REF]->(c)  
MERGE (b)-[:REF]->(d)  
MERGE (b)-[:REF]->(e)
MERGE (c)-[:REF]->(f)  
MERGE (d)-[:REF]->(g)  
MERGE (e)-[:REF]->(g)  
MERGE (e)-[:REF]->(h)
MERGE (f)-[:REF]->(i)  
MERGE (f)-[:REF]->(j)  
MERGE (f)-[:REF]->(k)  
MERGE (g)-[:REF]->(k)
MERGE (g)-[:REF]->(j)

use this code will get the result 'i'

MATCH (n:A) 
WHERE NOT exists((n)-[*]->(:B)) 
  AND NOT exists((n)<-[*]-(:B))
RETURN n;

But when there are 800,000 nodes (400,000 type A, 400,000 type B) and over 1.4 million edges in the graph, this code cannot run the result

1

1 Answers

0
votes

Some thoughts:

  • I don’t think this global graph search can be solved with a single query. You will need some kind of process to optimise exploration and use the result up to a certain point in subsequent steps.
  • when you could assign node labels instead of properties to reflect the state of a node, you could use apoc.path.expandConfig to just explore paths until you hit a node with state B.
  • you don’t need to re-investigate state A nodes that you traverse before you hit a node with state B, because they will not meet the requirements.

Another approach could be this, given the fact that all nodes that are on the up or downstream paths from a B node, will not fulfil the requirements. Still assuming that you use labels to distinguish A and B nodes.

MATCH (b:B)
CALL apoc.path.spanningTree(b,
                           {relationshipFilter: "<",
                            labelFilter:"/B"
                           }
                           ) YIELD path
UNWIND nodes(path) AS downStreamNode
WITH b,COLLECT(DISTINCT downStreamNode) AS downStreamNodes
CALL apoc.path.spanningTree(b,
                           {relationshipFilter: ">",
                            labelFilter:"/B"}
                           ) YIELD path
UNWIND nodes(path) AS upStreamNode
WITH b,downStreamNodes+COLLECT(DISTINCT upStreamNode) AS upAndDownStreamNodes
RETURN apoc.coll.toSet(apoc.coll.flatten(COLLECT(upAndDownStreamNodes))) AS allNodesThatDoNotFulfillRequirements