Neo4j Cypher Optimize Create relationship between each node in cartesian product

Question

I have a set of nodes, all labeled Word. I want to connect each Word to all the other words with a relationship called Distance. I do the following query:

match (word1:Word)
with word1
match (word2:Word)
where word1 <> word2
merge (word1)-[distance:DISTANCE ]->(word2)
return word1, distance, word2

It runs forever. There are only ~600 nodes and although I expect 600*600 relationships, the query shouldn't run for two hours! It is quicker in Java than in Neo4j. What advice do you have to make it quicker? I have already added an index on one of the properties and it's not improving.

That's almost 360K relationships created in one transaction, which could be a lot depending on your memory settings. — Frank Pavageau
Note you are also returning 360K rows. At the least, you can eliminate your return, as it doesn't seem like you need it. You also don't need to add a variable on the :DISTANCE relationship, if you're not going to do anything further with it in the query. If you need to batch up your merges, try the APOC library procedure apoc.periodic.commit(). — InverseFalcon

cybersam cybersam · Accepted Answer · 2016-11-14T22:50:58

Some observations:

Your query will try to perform 2*600*599 (or 718,800) MERGE operations in a single transaction. The reason for the factor of 2 is because every pair of words (say, x and y) will be seen twice (as x/y and y/x). You (presumably) only want to perform half that number of operations.
The x/y and y/x behavior also causes an attempt to ensure there are 2 DISTANCE relationships for each word pair -- one in either direction. That is (presumably) twice the number of relationships than you want (or need).
Trying to perform 720K (or even just 360K) operations in a single transaction may cause the DB server to run out of memory.

Here is a modified query that might fix the above issues. The ID(w1) < ID(w2) test makes sure the 2 words in a pair are not the same AND that the same pair is only processed once. It also uses the APOC procedure apoc.periodic.iterate to create 10K relationships at a time in separate transactions, in parallel.

CALL apoc.periodic.iterate(
  'MATCH (w1:Word), (w2:Word) WHERE ID(w1) < ID(w2) RETURN w1, w2',
  'CREATE (w1)-[:DISTANCE]->(w2)',
  {batchSize:10000, parallel:true}) YIELD batches, total
RETURN *

NOTE 1: This query assumes that you start out without any DISTANCE relationships in the DB, so it uses the cheaper CREATE clause instead of MERGE. If DISTANCE relationships exist already, then use MERGE instead (but this could create a second relationship between the same pair if the first relationship was in the opposite direction).

NOTE 2: Performing the batches in parallel should be safe because issue #2 is not possible with the new Cypher code. If 2 transactions were to attempt to create relationships in opposite directions between the same 2 nodes at the same time, that could result in a deadlock, which would cause at least one of the transactions to fail.

NOTE 3: This query assumes that the first statement (with the MATCH clause) does not itself run out of memory or take too long to process. If that assumption is wrong, then using a suitably modified query with apoc.periodic.commit should work.

Neo4j Cypher Optimize Create relationship between each node in cartesian product

1 Answers