Optimizing high volume batch inserts into Neo4j using REST

Question

I need to insert a huge amount of nodes with relationships between them into Neo4j via REST API's Batch endpoint, approx 5k records/s (still increasing).

This will be continuous insertion 24x7. Each record may require creating one node only, but other may require two nodes and one relationship being created.

Can I improve the performance of the inserts by changing my procedure or modifying the settings of Neo4j?

My progress so far:

1. I have been testing with Neo4j for a while, but I could not get the performance I needed

Test server box: 24 cores + 32GB RAM

Neo4j 2.0.0-M06 installed as a standalone service.

Running my Java application on the same server.(Neo4j and Java app will need to run on their own server in the future, so embedded mode can not be used)

REST API Endpoint : /db/data/batch (target: /cypher)

Using schema index, constrains, MERGE, CREATE UNIQUE.

2. My schema:

neo4j-sh (0)$ schema
==> Indexes
==>   ON :REPLY(created_at)   ONLINE                             
==>   ON :REPLY(ids)          ONLINE (for uniqueness constraint) 
==>   ON :REPOST(created_at) ONLINE                             
==>   ON :REPOST(ids)        ONLINE (for uniqueness constraint) 
==>   ON :Post(userId)      ONLINE                             
==>   ON :Post(postId)    ONLINE (for uniqueness constraint) 
==> 
==> Constraints
==>   ON (post:Post) ASSERT post.postId IS UNIQUE
==>   ON (repost:REPOST) ASSERT repost.ids IS UNIQUE
==>   ON (reply:REPLY) ASSERT reply.ids IS UNIQUE

3. My cypher queries and JSON requests

3.1. When one record requires single node creation, the job description looks like below

{"method" : "POST","to" : "/cypher","body" : {"query" : "MERGE (child:Post {postId:1001, userId:901})"}}

3.2. When one record requires two nodes with one relationship to be created, the job description looks like below

{"method" : "POST","to" : "/cypher","body" : {"query" : "MERGE (parent:Post {postId:1002, userId:902}) MERGE (child:Post {postId:1003, userId:903}) CREATE UNIQUE parent-[relationship:REPOST {ids:'1002_1003', created_at:'Wed Nov 06 14:06:56 AST 2013' }]->child"}}

3.3. I normally send 100 job descriptions (mixed 3.1 and 3.2) per batch which takes about 150~250ms to get it done.

4. Performance problems

4.1. Concurrency:

/db/data/batch (target: /cypher) seems not thread safe, tested with two or more concurrent threads which brought Neo4j server down within second(s) ~ minute(s).

4.2. MERGE with constrains does not always work.

When creating two nodes and one relationship with a single query (mentioned above in 3.2.), it sometime works like a charm; but it sometime fails with a CypherExecutionException and saying one of the Node xxxx already exists with label aaaa and property "bbbbb"=[ccccc]; from my understanding, the MERGE is not suppose return any exception, but return the node if it already exist.

As result of the exception, the whole batch will fail and roll-back, which affect my insert rate.

I have opened an issue in GitHub for this issue, https://github.com/neo4j/neo4j/issues/1428

4.3. CREATE UNIQUE with constrains doesn't always work for relationship creation.

This is mentioned in the same github issue too.

4.4. Performance:

Actually, before I use batch with cypher, I have tried the legacy indexing with get_or_create (/db/data/index/node/Post?uniqueness=get_or_create & /db/data/index/relationship/XXXXX?uniqueness=get_or_create)

Because of the nature of those legacy index endpoints (they return location of the data in index instead location of the data in actual data storage), so I could not use them within batch (needed the feature of referring node created earlier in the same batch)

I know I could enable auto_indexing, and deal with data storage directly instead of legacy index, but they mentioned from 2.0.0, schema index is recommended over legacy index, so I decide to switch to the batch + cypher + schema index approach.

HOWEVER, with batch + cypher, I can only get about 200 job descriptions per second insert rate, it would have been much higher if the MERGE with constrains always worked, let's say about 600~800/s, but it's still much lower than 5k/s. I also tried schema index without any constrain, it ended up even lower performance in terms of insert rate.

Have you been using Streaming JSON? docs.neo4j.org/chunked/2.0.0-M06/rest-api-streaming.html. Next thing that pops out at me is regarding using legacy index endpoints. You can still refer to the creation in another batch operation. Example. Create a node using uniquness get/create. Then the next item in the batch is cypher that uses START n=node:IndexName(Key={value}) ..... So you can't reference it by batchid, but you can still take advantage of the get/create portion. — LameCoder
Thank you LameCoder! Yes, I looked at streaming, and thought that is more for retrieving from server, not sending to server, so I didn't try it, but I'll try it now. In addition, thanks for the tips for legacy index. :) — GoSkyLine
Thanks @plasmid87 for correcting my spelling, grammar, and formatting. — GoSkyLine
@GoSkyLine you're most welcome, I hope you receive a useful answer! — plasmid87
@LameCoder With streaming enabled and batch size:1000, I was able to get insert rate at 2000+/s, Wow! that's absolutely an improvement!!However, I'm seeing some behaviour change too, 1. I'm no longer receiving that exception anymore (are they all succeed or just not returning the exception anymore?); 2. I'm not seeing all data in DB. Because I had about 2.3M records for my test, but only 101,318 nodes were created. 3. I was counting total number of nodes in db while my app was running, the number wasn't always increasing, 8->...->63082->62873->62503->62669->62516->62462->...->101318 Any idea — GoSkyLine

Michael Hunger Michael Hunger · Accepted Answer · 2013-11-08T00:46:36

With 2.0 I would use the transactional endpoint to create your statements in batches, e.g. 100 or 1000 per http request and about 30k-50k per transaction (until you commit).

See this for the format of the new streaming, transactional endpoint:

http://docs.neo4j.org/chunked/milestone/rest-api-transactional.html

Also for such a high performance, continuous insertion endpoint I heartily recommend writing a server extension which would run against the embedded API and can easily insert 10k or more nodes and relationships per second, see here for the documentation:

http://docs.neo4j.org/chunked/milestone/server-unmanaged-extensions.html

For pure inserts you don't need Cypher. And for concurrency, just take a lock at a well known (per subgraph that you are inserting) node so that concurrent inserts are no issue, you can do that with tx.acquireWriteLock() or by removing a non-existent property from a node (REMOVE n.__lock__).

For another example of writing an unmanaged extension (but one that uses cypher), check out this project. It even has a mode that might help you (POSTing CSV files to the server endpoint to be executed using a cypher statement per row).

https://github.com/jexp/cypher-rs

Optimizing high volume batch inserts into Neo4j using REST

1 Answers