Our py2neo script ingests abstracts at a rate of about 500,000 a day with Neo4J. For comparison, we ingest 20 million of these abstracts in Solr in one day. We're wondering if this is the expected rate of ingestion for Neo4J or if there is something we can do to increase performance?
We've tried combinations of py2neo version 2 and version 3 and Neo4J Enterprise version 2 and 3. With each combination, the ingestion rate remains about the same. We use batches of 1000 abstracts to increase performance. The abstracts average about 400-500 words, we create 5 additional entities with modest properties then create a relationship between each abstract and the entities. We first ingest the entities and then the relationships (create_unique()) to avoid round trips to the server (no find() or find_one()). We prefer merge() over create() to ensure only one node is created per abstract. We did try create() and the load performance only improved slightly. The bottleneck appears to be on the server side. Our script will create the 1000 transactions quickly, then there is an extended delay during the commit, suggesting any slowdown is from Neo4J server while it processes the transaction.
We require a solution that does not wipe the entire Neo4J database. We intend to ingest multiple data streams in parallel in the future so the DB must remain stable.
We prefer Python over Java and prefer py2neo's merge()/create() based transactions over direct Cypher queries.
We were hoping Bolt would give us better performance, but currently a Bolt transaction hangs indefinitely with py2neo v3 / Neo4J 3.0.0 RC1. We also had one instance of the HTTP transaction hanging as well.
Our Neo4J instances use the default configuration.
Our server is a 2 processor, 12 core, Linux host with 32GB of memory.
Any suggestions on how to increase load performance? It would be grand if we could ingest 20 million abstracts into Neo4J in just a few days.
Our ingestion script shows a transaction rate of 54 entity transactions per second. Note that's 54, not 54K:
$ python3 neo-ingestion-rate.py
Number of batches: 8
Entity transactions per batch: 6144
Merge entities: 2016-04-22 16:31:50.599126
All entities committed: 2016-04-22 16:47:08.480335
Entity transactions per second: 53.5494121750082
Relationship transactions per batch: 5120
Merge unique relationships: 2016-04-22 16:47:08.480408
All relationships committed: 2016-04-22 16:49:38.102694
Number of transactions: 40960
Relationship transactions per second: 273.75593641599323
Thanks.
transaction=graph.begin() / node=Node() / transaction.merge(node) / transaction.commit()
. Our schema is 6 nodes per abstract with relationships from each abstract to the other 5 nodes. We'll try to set up a demo of our slow ingestion rate and post it here to help with the analysis. - Saoirse