5
votes

I'm struggling to efficiently bulk update relationship properties in Neo4j. The objective is to update ~ 500,000 relationships (each with roughly 3 properties) which I chunk into batches of 1,000 and processing in a single Cypher statement,

UNWIND {rows} AS row
MATCH (s:Entity) WHERE s.uuid = row.source
MATCH (t:Entity) WHERE t.uuid = row.target
MATCH (s)-[r:CONSUMED]->(t)
SET r += row.properties

however each batch of 1,000 nodes takes around 60 seconds. There exists an index on UUID property for the :Entity label, i.e. I've previously run,

CREATE INDEX ON :Entity(uuid)

which means that matching the relationship is super efficient per the query plan,

enter image description here

There's 6 total db hits and the query executes in ~ 150 ms. I've also added a uniqueness constraint on the UUID property which ensures that each match only returns one element,

CREATE CONSTRAINT ON (n:Entity) ASSERT n.uuid IS UNIQUE

Does anyone know how I can further debug this to understand why it's taking Neo4j so long to process the relationships?

Note that I'm using similar logic for updating nodes which is orders of magnitude faster which have significant more metadata associated with them.

For reference I'm using Neo4j 3.0.3, py2neo, and Bolt. The Python code block is of the form,

for chunk in chunker(relationships): # 1,000 relationships per chunk
    with graph.begin() as tx:
        statement = """
            UNWIND {rows} AS row
            MATCH (s:Entity) WHERE s.uuid = row.source
            MATCH (t:Entity) WHERE t.uuid = row.target
            MATCH (s)-[r:CONSUMED]->(t)
            SET r += row.properties
            """

            rows = []

            for rel in chunk:
                rows.append({
                    'properties': dict(rel),
                    'source': rel.start_node()['uuid'],
                    'target': rel.end_node()['uuid'],
                })

            tx.run(statement, rows=rows)
2

2 Answers

2
votes

Try this query:

UNWIND {rows} AS row
WITH row.source as source, row.target as target, row
MATCH (s:Entity {uuid:source})
USING INDEX s:Entity(uuid)
WITH * WHERE true
MATCH (t:Entity {uuid:target})
USING INDEX t:Entity(uuid)
MATCH (s)-[r:CONSUMED]->(t)
SET r += row.properties;

It uses index hints to force an index lookup for both Entity nodes and then an Expand(Into) operator which should be more performant than the Expand(All) and Filter operators shown by your query plan.

0
votes

@william-lyon I was wondering whether I need the WITH * WHERE true clause? The reason I ask is that the number of DB hits increases from 4 to 8, i.e.

PROFILE
MATCH (s:Entity {uuid:row.source})
USING INDEX s:Entity(uuid)
MATCH (t:Entity {uuid:row.target})
USING INDEX t:Entity(uuid)
MATCH (s)-[r:CONSUMED]->(t)

returns

whereas

PROFILE
MATCH (s:Entity {uuid:row.source})
USING INDEX s:Entity(uuid)
WITH * WHERE true
MATCH (t:Entity {uuid:row.target})
USING INDEX t:Entity(uuid)
MATCH (s)-[r:CONSUMED]->(t)

returns

Note that the using the index hints reduces the number of DB hits from 6 down to 4. For context we have multiple node labels (and indexes) though every node has the :Entity label.