I have a dataset of roughly 100GB split into 50 files which I want to import into an empty neo4j-3.5.5-enterprise with 16 GB heap and 8 GB page cache on a 32 GB RAM AWS server. The data is in JSON lines format and fed via python into this query:
WITH {list} as list
UNWIND list as data
MERGE (p:LabelA {id: data.id})
SET p.prop1 = data.prop1, p.prop2 = data.prop2, p.prop3 = data.prop3, p.prop4 = data.prop4,
p.prop5 = data.prop5, p.prop6 = data.prop6, p.prop7 = data.prop7, p.prop8 = data.prop8
MERGE (j:LabelF {name: data.prop9})
MERGE (p) -[:RELATION_A {prop1: data.prop10, prop2: data.prop11}]-> (j)
MERGE (v:LabelC {name: data.prop12})
MERGE (p) -[:RELATION_B]-> (v)
FOREACH (elem IN data.prop13 |
MERGE (a:LabelB {id: elem.id}) ON CREATE
SET a.name = elem.name
MERGE (a) -[:RELATION_C]-> (p)
)
FOREACH (elem IN data.prop14 |
MERGE (u:LabelF {name: elem.name})
MERGE (u) -[:RELATION_C]-> (p)
)
FOREACH (elem IN data.prop15 |
MERGE (e:LabelD {name: elem})
MERGE (p) -[:RELATION_D]-> (e)
)
FOREACH (elem IN data.prop16 |
MERGE (out:LabelA {id: elem})
MERGE (p) -[:RELATION_E]-> (out)
)
FOREACH (elem IN data.prop17 |
MERGE (inc:LabelA {id: elem})
MERGE (p) <-[:RELATION_E]- (inc)
)
FOREACH (elem IN data.prop18 |
MERGE (pdf:LabelG {name: elem})
MERGE (p) -[:RELATION_F]-> (pdf)
)
FOREACH (elem IN data.prop19 |
MERGE (s:LabelE {name: elem})
MERGE (p) -[:RELATION_G]-> (s)
)
list
contains 200 JSON lines and each query runs in its own transaction.
Indexes are set prior to data import:
self.graph.run('CREATE INDEX ON :LabelB(name)')
self.graph.run('CREATE CONSTRAINT ON (p:LabelA) ASSERT (p.id) IS NODE KEY;')
self.graph.run('CREATE CONSTRAINT ON (p:LabelB) ASSERT (p.id) IS NODE KEY;')
for label in ['LabelC', 'LabelD', 'LabelE', 'LabelF', 'LabelG', 'LabelF']:
self.graph.run(f'CREATE CONSTRAINT ON (p:{label}) ASSERT (p.name) IS NODE KEY;')
The first couple of checkpoints are still somewhat fast (?):
2019-05-23 15:49:10.141+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 45 checkpoint completed in 134ms
2019-05-23 16:04:45.515+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 1603 checkpoint completed in 35s 345ms
2019-05-23 16:22:38.299+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 3253 checkpoint completed in 2m 52s 483ms
But at some point they stick to durations of about 20-25 minutes per checkpoint (this is from a previous attempt):
2019-05-23 07:40:03.755 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 18240 checkpoint started...
2019-05-23 07:42:15.942 INFO [o.n.k.NeoStoreDataSource] Rotated to transaction log [/var/lib/neo4j/data/databases/graph.db/neostore.transaction.db.144] version=144, last transaction in previous log=18253
2019-05-23 07:45:51.982 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=224, gcTime=240, gcCount=1}
2019-05-23 07:46:42.059 INFO [o.n.k.i.s.c.CountsTracker] Rotated counts store at transaction 18279 to [/data/databases/graph.db/neostore.counts.db.a], from [/data/databases/graph.db/neostore.counts.db.b].
2019-05-23 07:53:49.108 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=158, gcTime=157, gcCount=1}
2019-05-23 08:03:11.556 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 18240 checkpoint completed in 23m 7s 800ms
2019-05-23 08:03:11.710 INFO [o.n.k.i.t.l.p.LogPruningImpl] Pruned log versions 140-141, last checkpoint was made in version 143
2019-05-23 08:04:38.454 INFO [o.n.k.NeoStoreDataSource] Rotated to transaction log [/var/lib/neo4j/data/databases/graph.db/neostore.transaction.db.145] version=145, last transaction in previous log=18377
2019-05-23 08:05:57.288 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=248, gcTime=253, gcCount=1}
2019-05-23 08:11:08.331 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=143, gcTime=224, gcCount=1}
2019-05-23 08:16:37.491 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=228, gcTime=237, gcCount=1}
2019-05-23 08:18:11.732 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 18471 checkpoint started...
2019-05-23 08:23:18.767 INFO [o.n.k.NeoStoreDataSource] Rotated to transaction log [/var/lib/neo4j/data/databases/graph.db/neostore.transaction.db.146] version=146, last transaction in previous log=18496
2019-05-23 08:24:55.141 INFO [o.n.k.i.s.c.CountsTracker] Rotated counts store at transaction 18505 to [/data/databases/graph.db/neostore.counts.db.b], from [/data/databases/graph.db/neostore.counts.db.a].
2019-05-23 08:38:21.660 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=136, gcTime=195, gcCount=1}
2019-05-23 08:40:46.261 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 18471 checkpoint completed in 22m 34s 529ms
2019-05-23 08:40:46.281 INFO [o.n.k.i.t.l.p.LogPruningImpl] Pruned log versions 142-143, last checkpoint was made in version 145
Can anyone tell me what's happening here? I tried tinkering with the transaction retention and log size properties (increasing/decreasing) to no avail and ran this on a 64GB AWS server with 24 GB heap and 24 GB page cache. In all scenarios the time it takes to complete a checkpoint is increasing. While it took the first two files less than two hours to complete I aborted the import process of the third file after 6 hours because it would do stuff for 15 minutes (default time between checkpoints) and then be stuck in a checkpoint for 25 minutes.
Update 2019-05-24 17:35 UTC+2
I attempted the first part of cybersam's solution by using CREATE
for relationship creation where applicable (i.e. the relationship is only represented once in the 100GB) and threw out the creation of RELATION_E
which I will perform at a later stage with preprocessed input files. This leads to the following import query:
WITH {list} as list
UNWIND list as data
MERGE (p:LabelA {id: data.id})
SET p.prop1 = data.prop1, p.prop2 = data.prop2, p.prop3 = data.prop3, p.prop4 = data.prop4,
p.prop5 = data.prop5, p.prop6 = data.prop6, p.prop7 = data.prop7, p.prop8 = data.prop8
MERGE (j:LabelF {name: data.prop9})
CREATE (p) -[:RELATION_A {prop1: data.prop10, prop2: data.prop11}]-> (j)
MERGE (v:LabelC {name: data.prop12})
CREATE (p) -[:RELATION_B]-> (v)
FOREACH (elem IN data.prop13 |
MERGE (a:LabelB {id: elem.id}) ON CREATE
SET a.name = elem.name
CREATE (a) -[:RELATION_C]-> (p)
)
FOREACH (elem IN data.prop14 |
MERGE (u:LabelF {name: elem.name})
CREATE (u) -[:RELATION_C]-> (p)
)
FOREACH (elem IN data.prop15 |
MERGE (e:LabelD {name: elem})
CREATE (p) -[:RELATION_D]-> (e)
)
FOREACH (elem IN data.prop18 |
MERGE (pdf:LabelG {name: elem})
CREATE (p) -[:RELATION_F]-> (pdf)
)
FOREACH (elem IN data.prop19 |
MERGE (s:LabelE {name: elem})
CREATE (p) -[:RELATION_G]-> (s)
)
Then I stopped my Neo4j, deleted the graph.db directory, changed the config to perform checkpoints every 15 sec so I can quickly tell if the checkpointing time still increases and started the data import. Unfortunately the time still increases:
2019-05-24 15:25:40.718+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 59 checkpoint completed in 240ms
2019-05-24 15:26:02.003+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 86 checkpoint completed in 1s 283ms
2019-05-24 15:26:27.518+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 105 checkpoint completed in 5s 514ms
2019-05-24 15:26:55.079+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 141 checkpoint completed in 7s 559ms
2019-05-24 15:27:23.944+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 179 checkpoint completed in 8s 864ms
2019-05-24 15:27:59.768+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 218 checkpoint completed in 15s 823ms
2019-05-24 15:28:42.819+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 269 checkpoint completed in 23s 9ms
2019-05-24 15:29:33.318+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 328 checkpoint completed in 30s 498ms
2019-05-24 15:30:32.847+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 397 checkpoint completed in 39s 489ms
2019-05-24 15:31:41.918+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 480 checkpoint completed in 49s 30ms
2019-05-24 15:33:03.113+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 576 checkpoint completed in 1m 1s 194ms
Is there an index missing somewhere?
Update 2019-05-28 18:44 UTC+2
I created a parameter set of 100 lines and imported them using PROFILE
into an empty Neo4j. The query plan looks like this:
Update 2019-07-14 16:11 UTC+2
I circumvented the problem by preprocessing my JSONL files into CSV files without duplicates and using neo4j-admin import
.