I'm struggling to efficiently bulk update relationship properties in Neo4j. The objective is to update ~ 500,000 relationships (each with roughly 3 properties) which I chunk into batches of 1,000 and processing in a single Cypher statement,
UNWIND {rows} AS row
MATCH (s:Entity) WHERE s.uuid = row.source
MATCH (t:Entity) WHERE t.uuid = row.target
MATCH (s)-[r:CONSUMED]->(t)
SET r += row.properties
however each batch of 1,000 nodes takes around 60 seconds. There exists an index on UUID property for the :Entity
label, i.e. I've previously run,
CREATE INDEX ON :Entity(uuid)
which means that matching the relationship is super efficient per the query plan,
There's 6 total db hits and the query executes in ~ 150 ms. I've also added a uniqueness constraint on the UUID property which ensures that each match only returns one element,
CREATE CONSTRAINT ON (n:Entity) ASSERT n.uuid IS UNIQUE
Does anyone know how I can further debug this to understand why it's taking Neo4j so long to process the relationships?
Note that I'm using similar logic for updating nodes which is orders of magnitude faster which have significant more metadata associated with them.
For reference I'm using Neo4j 3.0.3, py2neo, and Bolt. The Python code block is of the form,
for chunk in chunker(relationships): # 1,000 relationships per chunk
with graph.begin() as tx:
statement = """
UNWIND {rows} AS row
MATCH (s:Entity) WHERE s.uuid = row.source
MATCH (t:Entity) WHERE t.uuid = row.target
MATCH (s)-[r:CONSUMED]->(t)
SET r += row.properties
"""
rows = []
for rel in chunk:
rows.append({
'properties': dict(rel),
'source': rel.start_node()['uuid'],
'target': rel.end_node()['uuid'],
})
tx.run(statement, rows=rows)