1
votes

I'm loading a Neo4j database using Cypher commands piped directly into the neo4j-shell. Some experiments suggest that subgraph batches of about 1000 lines give the optimal throughput (about 3.2ms/line, 300 lines/sec (slow!), Neo4j 2.0.1). I use MATCH statements to bind existing nodes to the loading subgraph. Here's a chopped example:

begin
...
MATCH (domain75ea8a4da9d65189999d895f536acfa5:SubDomain { shorturl: "threeboysandanoldlady.blogspot.com" })
MATCH (domainf47c8afacb0346a5d7c4b8b0e968bb74:SubDomain { shorturl: "myweeview.com" })
MATCH (domainf431704fab917205a54b2477d00a3511:SubDomain { shorturl: "www.computershopper.com" })
CREATE
(article1641203:Article { id: "1641203", url: "http://www.coolsocial.net/sites/www/blackhawknetwork.com.html", type: 4, timestamp: 1342549270, datetime: "2012-07-17 18:21:10"}),
(article1641203)-[:PUBLISHED_IN]->(domaina9b3ed6f4bc801731351b913dfc3f35a),(author104675)-[:WROTE]->(article1641203),
....
commit

Using this (ridiculously slow) method, it takes several hours to load 200K nodes (~370K relationships) and, at that point, the loading slows down even more. I presume the asymptotic slowdown is due to the overhead of the MATCH statements. They make up 1/2 of the subgraph load statements by the time the graph hits 200K nodes. There's got to be a better way of doing this, it just doesn't scale.

I'm going to try rewriting the statements with parameters (refs: What is the most efficient way to insert nodes into a neo4j database using cypher AND http://jexp.de/blog/2013/05/on-importing-data-in-neo4j-blog-series/). I expect that to help, but it seems that I will still have problems making the subgraph connections. Would using MERGE or CREATE UNIQUE instead of the MATCH statements be the way to go? There must be best practice ways to do this that I'm missing. Any other speed-up ideas?

many thanks

2

2 Answers

1
votes

Use MERGE, and do smaller transactions--I've found best results with batches of 50-100 (while doing index lookups). Bigger batches are better when doing CREATE only without MATCH. Also, I recommend using a driver to send your commands over the transactional API (with parameters) instead of via neo4j-shell--it tends to be a fair bit faster.

Alternatively (might not be applicable to all use cases), keep a local "index" of the node ids you've created. For only 200k items, this should be easy to fit in a normal map/dict of string->long. This will prevent you needing to tax the index on the db, and you can do only node-ID-based lookups and CREATE statements, and create the indexes later.

1
votes

The load2neo plugin worked well for me. Installation was fast+painless and it has a very cypher-like command structure that easily supports uniqueness requirements. Works with neo4j 2.0 labels.

load2neo install + curl usage example: http://nigelsmall.com/load2neo

load2neo Geoff syntax: http://nigelsmall.com/geoff

It is much faster (>>10x) than using Cypher via neo4j-shell.

I wasn't able to get the parameters in Cypher through neo4j-shell working despite trying everything I could find via internet search.