I'm loading a Neo4j database using Cypher commands piped directly into the neo4j-shell. Some experiments suggest that subgraph batches of about 1000 lines give the optimal throughput (about 3.2ms/line, 300 lines/sec (slow!), Neo4j 2.0.1). I use MATCH statements to bind existing nodes to the loading subgraph. Here's a chopped example:
begin
...
MATCH (domain75ea8a4da9d65189999d895f536acfa5:SubDomain { shorturl: "threeboysandanoldlady.blogspot.com" })
MATCH (domainf47c8afacb0346a5d7c4b8b0e968bb74:SubDomain { shorturl: "myweeview.com" })
MATCH (domainf431704fab917205a54b2477d00a3511:SubDomain { shorturl: "www.computershopper.com" })
CREATE
(article1641203:Article { id: "1641203", url: "http://www.coolsocial.net/sites/www/blackhawknetwork.com.html", type: 4, timestamp: 1342549270, datetime: "2012-07-17 18:21:10"}),
(article1641203)-[:PUBLISHED_IN]->(domaina9b3ed6f4bc801731351b913dfc3f35a),(author104675)-[:WROTE]->(article1641203),
....
commit
Using this (ridiculously slow) method, it takes several hours to load 200K nodes (~370K relationships) and, at that point, the loading slows down even more. I presume the asymptotic slowdown is due to the overhead of the MATCH statements. They make up 1/2 of the subgraph load statements by the time the graph hits 200K nodes. There's got to be a better way of doing this, it just doesn't scale.
I'm going to try rewriting the statements with parameters (refs: What is the most efficient way to insert nodes into a neo4j database using cypher AND http://jexp.de/blog/2013/05/on-importing-data-in-neo4j-blog-series/). I expect that to help, but it seems that I will still have problems making the subgraph connections. Would using MERGE or CREATE UNIQUE instead of the MATCH statements be the way to go? There must be best practice ways to do this that I'm missing. Any other speed-up ideas?
many thanks