0
votes

So, i've created a Neo4j graph database out of a relational database. The graph database has about 7 million nodes, and about 9 million relationships between the nodes.

I now want to find all nodes, that are not connected to nodes with a certain label (let's call them unconnected nodes). For example, i have nodes with the labels "Customer" and "Order" (let's call them top-level-nodes). I want to find all nodes that have no relationship from or to these top-level-nodes. The relationship doesn't have to be direct, the nodes can be connected via other nodes to the top-level-nodes.

I have a cypher query which would solve this problem:

MATCH (a) WHERE not ((a)-[*]-(:Customer)) AND not ((a)-[*]-(:Order)) RETURN a; 

As you can imagine, the query will need a long time to execute, the performance is bad. Most likely because of the undirected relationship and because it doesn't matter via how many nodes the relationship can be made. However, the relationship directions don't matter, and i need to make sure that there is no path from any node to one of the top-level-nodes.

Is there any way to find the unconnected nodes faster ? Note that the database is really big, and there are more than 2 labels which mark top-level-nodes.

2

2 Answers

0
votes

You could try this approach, which does involve more operations, but can be run in batches for better performance (see apoc.periodic.commit() in the APOC procedures library).

The idea is to first apply a label (say, :Unconnected) to all nodes in your graph (batch execute with apoc.periodic.commit), and then, taking batches of top level nodes with that label, matching to all nodes in the subgraphs extending from them and removing that label.

When you finally have run out of top level nodes with the :Unconnected label (meaning all top level nodes and their subgraphs no longer have this label) then the only nodes remaining in your graph with the :Unconnected label are not connected to your top level nodes.

Any approach to this kind of operation will likely be slow, but the advantage again is that you can process this in batches, and if you get interrupted, you can resume. Once your queries are done, all the relevant unconnected nodes are now labeled for further processing at your convenience.

Also, one last note, in Neo4j undirected relationships have no arrows in the syntax ()-[*]-().

0
votes
MATCH (a) 
WHERE 
 not (a:Customer OR a:Order)
 AND shortestPath((a)-[*]-(:Customer)) IS NULL 
 AND shortestPath((a)-[*]-(:Order)) IS NULL
RETURN a;

If you could add rel-types it would be faster. One further optimization could be to check the nodes of an :Customer path for an :Order node and vice versa. i.e.

NONE(n in nodes(path) WHERE n:Order)

In general, this might be rather a set operation, i.e. expand around all order and customer nodes in parallel into two sets and compute the overlap between the two sets.

Then remove the overlap from the total number of nodes.

I added an issue for apoc here to add such a function or procedure

https://github.com/neo4j-contrib/neo4j-apoc-procedures/issues/223