4
votes

Our py2neo script ingests abstracts at a rate of about 500,000 a day with Neo4J. For comparison, we ingest 20 million of these abstracts in Solr in one day. We're wondering if this is the expected rate of ingestion for Neo4J or if there is something we can do to increase performance?

We've tried combinations of py2neo version 2 and version 3 and Neo4J Enterprise version 2 and 3. With each combination, the ingestion rate remains about the same. We use batches of 1000 abstracts to increase performance. The abstracts average about 400-500 words, we create 5 additional entities with modest properties then create a relationship between each abstract and the entities. We first ingest the entities and then the relationships (create_unique()) to avoid round trips to the server (no find() or find_one()). We prefer merge() over create() to ensure only one node is created per abstract. We did try create() and the load performance only improved slightly. The bottleneck appears to be on the server side. Our script will create the 1000 transactions quickly, then there is an extended delay during the commit, suggesting any slowdown is from Neo4J server while it processes the transaction.

We require a solution that does not wipe the entire Neo4J database. We intend to ingest multiple data streams in parallel in the future so the DB must remain stable.

We prefer Python over Java and prefer py2neo's merge()/create() based transactions over direct Cypher queries.

We were hoping Bolt would give us better performance, but currently a Bolt transaction hangs indefinitely with py2neo v3 / Neo4J 3.0.0 RC1. We also had one instance of the HTTP transaction hanging as well.

Our Neo4J instances use the default configuration.

Our server is a 2 processor, 12 core, Linux host with 32GB of memory.

Any suggestions on how to increase load performance? It would be grand if we could ingest 20 million abstracts into Neo4J in just a few days.

Our ingestion script shows a transaction rate of 54 entity transactions per second. Note that's 54, not 54K:

$ python3 neo-ingestion-rate.py
Number of batches: 8
Entity transactions per batch: 6144
Merge entities: 2016-04-22 16:31:50.599126
All entities committed: 2016-04-22 16:47:08.480335
Entity transactions per second: 53.5494121750082
Relationship transactions per batch: 5120
Merge unique relationships: 2016-04-22 16:47:08.480408
All relationships committed: 2016-04-22 16:49:38.102694
Number of transactions: 40960
Relationship transactions per second: 273.75593641599323

Thanks.

2
This is hard to tell without knowing what your schema and your scripts looks like, especially usage of indices. In general with cypher I have a write throughput of 6k nodes and 27k relationships (both with 3 properties on it) per second on my laptop, so there is a lot of room to investigate first your queries and scripts before the server imo - Christophe Willemsen
We use py2neo transaction=graph.begin() / node=Node() / transaction.merge(node) / transaction.commit(). Our schema is 6 nodes per abstract with relationships from each abstract to the other 5 nodes. We'll try to set up a demo of our slow ingestion rate and post it here to help with the analysis. - Saoirse
don't tell me you do this 6 times for 6 nodes right ? - Christophe Willemsen
We updated the original question with the script we used. Can you spot a flaw that would cause underwhelming performance? - Saoirse

2 Answers

1
votes

How about loading via neo4j-shell? I do the majority of my work in R and simply script the import.

Here is a blog post where I outline the approach. You could mirror it in Python.

The basic idea is take your data, save it to disk, and load via neo4j-shell where you execute cypher scripts that reference those files.

I have found this approach to be helpful when loading larger sets of data. But of course, it all depends on the density of your data, the data model itself, and having the appropriate indexes established.

0
votes

This blog post explains how to import data in bulk:

https://neo4j.com/blog/bulk-data-import-neo4j-3-0/

They claim being able to import ~31M nodes, ~78M relationships in ~3min

They just don't mention the machine this is running on, most likely a cluster.

Still, it shows it should be possible to get much much higher ingestion rate than what you observe.

The Python class likely import one record at a time, when you really want to do bulk inserts.