Neo4j multi-client massive insertion - REST very poor performance - other ways?

Question

I'm trying to benchmark Neo4j massive insertion in client-server environment. So far I've found that there are only two ways to do it:

use REST
implement server extension

I can say upfront that our design requires to be able to insert from many concurrently running processes/machines, so using batch insert with direct connection is not an option.

I would also like to avoid having to implement server extension as we already have tight schedule.

I benchmarked massive insertion via REST from just a single client, sending 2 kinds of very simple Cypher queries:

create (vertex:V {guid: {guid}, vtype: {vtype}, random1: {random1}, random2: {random2} })

match (a:V {guid: {a} }) match (b:V {guid: {b} }) create (a)-[:label]->(b)

Guid field had an index.

Results so far are very poor around (10k V + 40k E) in 13 minutes, compared to competing products like Titan or Orient, which provide efficient server out of the box and throughput at around (10k V + 40k E) per 1 minute.

I tried longer lasting transactions, and query parameters, none give any significant gains. Furthermore, the overhead from REST is very small as I tested dummy queries and they execute much much faster (and both client and server are on the same machine). I also tried inserting from multiple threads - performance does not scale up.

I found another StackOverflow question, where advise was to batch inserts into large requests containing thousands of commands and periodically commit. Unfortunatelly, due to the nature of how we generate the data, batching the requests is not feasible. Ideally we'd like the inserts to be atomic operations and have the results appear as soon as they are executed (no need for transactions in fact).

Thus my questions are:

are my Cypher queries optimal for the insertion?
are the results so far in line with what can be achieved with REST (or can I squeeze much more from REST) ?
are there any other ways to perform efficient multi-client massive insertion?

Is {guid} the same as either {a} or {b}? Also, have you already created an index (or uniqueness constraint) on :V(guid)? — cybersam
Yes, there is an index on guid field. All 3 vars might have different values and those 2 queries will be executed separately. However when I'm creating an edge, I can guarantee that both vertices have already been created. — rohrl
how many concurrent clients and how many cores did your server have? do you have an ssd? did you observe anything unusual in disk or cpu monitoring? — Michael Hunger

Brian Underwood Brian Underwood · Accepted Answer · 2016-02-04T09:02:34

I have a number of thoughts/questions that don't fit very well in a comment ;)

What version of Neo4j are you using? 2.3 introduced some things which might help
When you say you have an index, do you mean the new style and not the legacy indexes? The newer indexes are created with CREATE INDEX ON :V(guid) and apply to the combination of a label and a property. You can try your queries in the web console prefixed with PROFILE to see if the query is hitting the index and where it might be slow
If you can have the data in a CSV format you might look into the LOAD CSV clause in Cypher. That's also a batch sort of thing, so it might not be as useful
I don't think it would help performance much, but this is a bit nicer to read:

match (a:V {guid: {a} }), (b:V {guid: {b} }) create (a)-[:label]->(b)
I know it's of no help now, but Neo4j 3.0 is planned to have a new compressed binary socket protocol called Bolt which should be an improvement over REST. It's estimated for Q2

I know a lot of these suggestions probably aren't too helpful, but they're things to think about. There's also a public Slack chat for Neo4j here:

http://neo4j.com/blog/public-neo4j-users-slack-group/

I'll share this question there to see if anybody has any ideas

EDIT:

Max DeMarzi passed on one of this articles on queueing requests which might be useful:

http://maxdemarzi.com/2014/07/01/scaling-concurrent-writes-in-neo4j/

Looks like you'd need to write a bit of Java, but he lays it out for you

Neo4j multi-client massive insertion - REST very poor performance - other ways?

1 Answers