2
votes

I have a dataset of roughly 100GB split into 50 files which I want to import into an empty neo4j-3.5.5-enterprise with 16 GB heap and 8 GB page cache on a 32 GB RAM AWS server. The data is in JSON lines format and fed via python into this query:

WITH {list} as list
UNWIND list as data
MERGE (p:LabelA {id: data.id})
    SET p.prop1 = data.prop1, p.prop2 = data.prop2, p.prop3 = data.prop3, p.prop4 = data.prop4,
        p.prop5 = data.prop5, p.prop6 = data.prop6, p.prop7 = data.prop7, p.prop8 = data.prop8

MERGE (j:LabelF {name: data.prop9})
MERGE (p) -[:RELATION_A {prop1: data.prop10, prop2: data.prop11}]-> (j)

MERGE (v:LabelC {name: data.prop12})
MERGE (p) -[:RELATION_B]-> (v)

FOREACH (elem IN data.prop13 |
    MERGE (a:LabelB {id: elem.id}) ON CREATE
        SET a.name = elem.name
    MERGE (a) -[:RELATION_C]-> (p)
)

FOREACH (elem IN data.prop14 |
    MERGE (u:LabelF {name: elem.name})
    MERGE (u) -[:RELATION_C]-> (p)
)

FOREACH (elem IN data.prop15 |
    MERGE (e:LabelD {name: elem})
    MERGE (p) -[:RELATION_D]-> (e)
)

FOREACH (elem IN data.prop16 |
    MERGE (out:LabelA {id: elem})
    MERGE (p) -[:RELATION_E]-> (out)
)

FOREACH (elem IN data.prop17 |
    MERGE (inc:LabelA {id: elem})
    MERGE (p) <-[:RELATION_E]- (inc)
)

FOREACH (elem IN data.prop18 |
    MERGE (pdf:LabelG {name: elem})
    MERGE (p) -[:RELATION_F]-> (pdf)
)

FOREACH (elem IN data.prop19 |
    MERGE (s:LabelE {name: elem})
    MERGE (p) -[:RELATION_G]-> (s)
)

list contains 200 JSON lines and each query runs in its own transaction.

Indexes are set prior to data import:

self.graph.run('CREATE INDEX ON :LabelB(name)')
self.graph.run('CREATE CONSTRAINT ON (p:LabelA) ASSERT (p.id) IS NODE KEY;')
self.graph.run('CREATE CONSTRAINT ON (p:LabelB) ASSERT (p.id) IS NODE KEY;')
for label in ['LabelC', 'LabelD', 'LabelE', 'LabelF', 'LabelG', 'LabelF']:
    self.graph.run(f'CREATE CONSTRAINT ON (p:{label}) ASSERT (p.name) IS NODE KEY;')

The first couple of checkpoints are still somewhat fast (?):

2019-05-23 15:49:10.141+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 45 checkpoint completed in 134ms
2019-05-23 16:04:45.515+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 1603 checkpoint completed in 35s 345ms
2019-05-23 16:22:38.299+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 3253 checkpoint completed in 2m 52s 483ms

But at some point they stick to durations of about 20-25 minutes per checkpoint (this is from a previous attempt):

2019-05-23 07:40:03.755 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 18240 checkpoint started...
2019-05-23 07:42:15.942 INFO [o.n.k.NeoStoreDataSource] Rotated to transaction log [/var/lib/neo4j/data/databases/graph.db/neostore.transaction.db.144] version=144, last transaction in previous log=18253
2019-05-23 07:45:51.982 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=224, gcTime=240, gcCount=1}
2019-05-23 07:46:42.059 INFO [o.n.k.i.s.c.CountsTracker] Rotated counts store at transaction 18279 to [/data/databases/graph.db/neostore.counts.db.a], from [/data/databases/graph.db/neostore.counts.db.b].
2019-05-23 07:53:49.108 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=158, gcTime=157, gcCount=1}
2019-05-23 08:03:11.556 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 18240 checkpoint completed in 23m 7s 800ms 
2019-05-23 08:03:11.710 INFO [o.n.k.i.t.l.p.LogPruningImpl] Pruned log versions 140-141, last checkpoint was made in version 143

2019-05-23 08:04:38.454 INFO [o.n.k.NeoStoreDataSource] Rotated to transaction log [/var/lib/neo4j/data/databases/graph.db/neostore.transaction.db.145] version=145, last transaction in previous log=18377
2019-05-23 08:05:57.288 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=248, gcTime=253, gcCount=1}
2019-05-23 08:11:08.331 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=143, gcTime=224, gcCount=1}
2019-05-23 08:16:37.491 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=228, gcTime=237, gcCount=1}

2019-05-23 08:18:11.732 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 18471 checkpoint started...
2019-05-23 08:23:18.767 INFO [o.n.k.NeoStoreDataSource] Rotated to transaction log [/var/lib/neo4j/data/databases/graph.db/neostore.transaction.db.146] version=146, last transaction in previous log=18496
2019-05-23 08:24:55.141 INFO [o.n.k.i.s.c.CountsTracker] Rotated counts store at transaction 18505 to [/data/databases/graph.db/neostore.counts.db.b], from [/data/databases/graph.db/neostore.counts.db.a].
2019-05-23 08:38:21.660 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=136, gcTime=195, gcCount=1}
2019-05-23 08:40:46.261 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 18471 checkpoint completed in 22m 34s 529ms
2019-05-23 08:40:46.281 INFO [o.n.k.i.t.l.p.LogPruningImpl] Pruned log versions 142-143, last checkpoint was made in version 145

Can anyone tell me what's happening here? I tried tinkering with the transaction retention and log size properties (increasing/decreasing) to no avail and ran this on a 64GB AWS server with 24 GB heap and 24 GB page cache. In all scenarios the time it takes to complete a checkpoint is increasing. While it took the first two files less than two hours to complete I aborted the import process of the third file after 6 hours because it would do stuff for 15 minutes (default time between checkpoints) and then be stuck in a checkpoint for 25 minutes.

Update 2019-05-24 17:35 UTC+2

I attempted the first part of cybersam's solution by using CREATE for relationship creation where applicable (i.e. the relationship is only represented once in the 100GB) and threw out the creation of RELATION_E which I will perform at a later stage with preprocessed input files. This leads to the following import query:

WITH {list} as list
UNWIND list as data
MERGE (p:LabelA {id: data.id})
    SET p.prop1 = data.prop1, p.prop2 = data.prop2, p.prop3 = data.prop3, p.prop4 = data.prop4,
        p.prop5 = data.prop5, p.prop6 = data.prop6, p.prop7 = data.prop7, p.prop8 = data.prop8

MERGE (j:LabelF {name: data.prop9})
CREATE (p) -[:RELATION_A {prop1: data.prop10, prop2: data.prop11}]-> (j)

MERGE (v:LabelC {name: data.prop12})
CREATE (p) -[:RELATION_B]-> (v)

FOREACH (elem IN data.prop13 |
    MERGE (a:LabelB {id: elem.id}) ON CREATE
        SET a.name = elem.name
    CREATE (a) -[:RELATION_C]-> (p)
)

FOREACH (elem IN data.prop14 |
    MERGE (u:LabelF {name: elem.name})
    CREATE (u) -[:RELATION_C]-> (p)
)

FOREACH (elem IN data.prop15 |
    MERGE (e:LabelD {name: elem})
    CREATE (p) -[:RELATION_D]-> (e)
)

FOREACH (elem IN data.prop18 |
    MERGE (pdf:LabelG {name: elem})
    CREATE (p) -[:RELATION_F]-> (pdf)
)

FOREACH (elem IN data.prop19 |
    MERGE (s:LabelE {name: elem})
    CREATE (p) -[:RELATION_G]-> (s)
)

Then I stopped my Neo4j, deleted the graph.db directory, changed the config to perform checkpoints every 15 sec so I can quickly tell if the checkpointing time still increases and started the data import. Unfortunately the time still increases:

2019-05-24 15:25:40.718+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 59 checkpoint completed in 240ms
2019-05-24 15:26:02.003+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 86 checkpoint completed in 1s 283ms
2019-05-24 15:26:27.518+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 105 checkpoint completed in 5s 514ms
2019-05-24 15:26:55.079+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 141 checkpoint completed in 7s 559ms
2019-05-24 15:27:23.944+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 179 checkpoint completed in 8s 864ms
2019-05-24 15:27:59.768+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 218 checkpoint completed in 15s 823ms
2019-05-24 15:28:42.819+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 269 checkpoint completed in 23s 9ms
2019-05-24 15:29:33.318+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 328 checkpoint completed in 30s 498ms
2019-05-24 15:30:32.847+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 397 checkpoint completed in 39s 489ms
2019-05-24 15:31:41.918+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 480 checkpoint completed in 49s 30ms
2019-05-24 15:33:03.113+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 576 checkpoint completed in 1m 1s 194ms 

Is there an index missing somewhere?

Update 2019-05-28 18:44 UTC+2

I created a parameter set of 100 lines and imported them using PROFILE into an empty Neo4j. The query plan looks like this:

query execution plan

Update 2019-07-14 16:11 UTC+2

I circumvented the problem by preprocessing my JSONL files into CSV files without duplicates and using neo4j-admin import.

1

1 Answers

2
votes

The cost of each MERGE operation for a relationship will increase linearly with the number of relationships that must be scanned. Node indexing will help to optimize finding a relationship's endpoints, but neo4j still has to scan the relationships of one of those endpoints in order for MERGE to know if the desired relationship already exists.

So, your query's execution time will grow as the number of relationships possessed by the individual nodes needed as relationship endpoints increases (which I surmise is happening in your case as you repeatedly execute your query).

Here is a 2-step process for fixing this.

  1. Use a query that just uses MERGE to create the nodes (without any relationships). You can continue to use MERGE in this case, since you are just dealing with indexed nodes. For example:

    UNWIND $list as data
    MERGE (p:LabelA {id: data.id})
    ON CREATE SET
      p.prop1 = data.prop1, p.prop2 = data.prop2, p.prop3 = data.prop3, p.prop4 = data.prop4,
      p.prop5 = data.prop5, p.prop6 = data.prop6, p.prop7 = data.prop7, p.prop8 = data.prop8
    
    MERGE (j:LabelF {name: data.name})    
    MERGE (v:LabelC {name: data.propA})
    
    FOREACH (e IN data.propB |
      MERGE (a:LabelB {id: e.id}) ON CREATE SET a.name = e.name)
    FOREACH (e IN data.propC |
      MERGE (:LabelF {name: e.name}))
    FOREACH (e IN data.propD + data.propG + data.propH |
      MERGE (:LabelD {name: e}))
    FOREACH (e IN data.propE + data.propF |
      MERGE (:LabelA {id: e}))
    
  2. Use a query that will process each relationship just once, allowing you to use CREATE (which does not need to scan) instead of MERGE.

    NOTE: This second step requires that no 2 $list parameters (as used in separate query invocations) contain data that would cause the creation of the same relationship. And the same constraint exists within a single $list parameter as well. Generating such $list parameters is an exercise left up to you.

    Once you have appropriate $list parameters, you can do this:

    UNWIND $list as data
    MATCH (p:LabelA {id: data.id})
    
    MATCH (j:LabelF {name: data.name})
    CREATE (p) -[:RELATION_A {prop1: data.prop1, prop2: data.prop2}]-> (j) 
    
    WITH p, data
    
    MATCH (v:LabelC) WHERE v.name IN data.propA
    CREATE (p) -[:RELATION_B]-> (v) 
    
    WITH p, data
    
    UNWIND data.propB as elem
    MATCH (a:LabelB {id: elem.id})
    CREATE (a) -[:RELATION_C]-> (p)   
    
    WITH p, data
    
    UNWIND data.propC as elem
    MATCH (u:LabelF) WHERE u.name IN elem.name
    CREATE (u) -[:RELATION_C]-> (p)
    
    WITH p, data
    
    MATCH (e:LabelD) WHERE e.name IN data.propD
    CREATE (p) -[:RELATION_D]-> (e) 
    
    WITH p, data
    
    MATCH (out:LabelA) WHERE out.id IN data.propE
    CREATE (p) -[:RELATION_E]-> (out)
    
    WITH p, data
    
    MATCH (inc:LabelA) WHERE inc.id IN data.propF
    CREATE (p) <-[:RELATION_E]- (inc)
    
    WITH p, data
    
    MATCH (pdf:G) WHERE pdf.name IN data.propG
    CREATE (p) -[:RELATION_F]-> (pdf)
    
    WITH p, data
    
    MATCH (s:LabelE) WHERE s.name IN data.propH
    CREATE (p) -[:RELATION_G]-> (s)