Should I use Neo4j's Import Tool or Load Command to Insert Several Million Rows?

Question

I have several CSV files that range from 25-100 MB in size. I have created constraints, created indices, am using periodic commit, and increased the allocated memory in the neo4j-wrapper.conf and neo4j.properties.

neo4j.properties:

neostore.nodestore.db.mapped_memory=50M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=100M
neostore.propertystore.db.strings.mapped_memory=100M
neostore.propertystore.db.arrays.mapped_memory=0M

neo4j-wrapper.conf changes:

wrapper.java.initmemory=5000
wrapper.java.maxmemory=5000

However my load is still taking a very long time, and I am considering using the recently released Import Tool (http://neo4j.com/docs/milestone/import-tool.html). Before I switch to it, I was wondering whether I could be doing anything else to improve the speed of my imports.

I begin by creating several constraints to make sure that the IDs I'm using are unique:

CREATE CONSTRAINT ON (Country) ASSERT c.Name IS UNIQUE;
//and constraints for other name identifiers as well..

I then use periodic commit...

USING PERIODIC COMMIT 10000

I then LOAD in the CSV where I ignore several fields

LOAD CSV WITH HEADERS FROM "file:/path/to/file/MyFile.csv" as line
WITH line
WHERE line.CountryName IS NOT NULL AND line.CityName IS NOT NULL AND line.NeighborhoodName IS NOT NULL

I then create the necessary nodes from my data.

WITH line
MERGE(country:Country {name : line.CountryName})
MERGE(city:City {name : line.CityName})
MERGE(neighborhood:Neighborhood {
     name : line.NeighborhoodName,
     size : toInt(line.NeighborhoodSize),
     nickname : coalesce(line.NeighborhoodNN, ""),
     ... 50 other features
    })

MERGE (city)-[:IN]->(Country)
CREATE (neighborhood)-[:IN]->(city)
//Note that each neighborhood only appears once

Does it make sense to use CREATE UNIQUE rather than applying MERGE to any COUNTRY reference? Would this speed it up?

A ~250,000-line CSV file took over 12 hours to complete, and seemed excessively slow. What else can I be doing to speed this up? Or does it just make sense to use the annoying-looking Import Tool?

Brian Underwood Brian Underwood · Accepted Answer · 2015-04-24T22:27:46

A couple of things. Firstly, I would suggest reading Mark Needham's "Avoiding the Eager" blog post:

http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/

Basically what it says is that you should add a PROFILE to the start of each of your queries to see if any of them use the Eager operator. If they do this can really cost you performance-wise and you should probably split up your queries into separate MERGEs

Secondly, your neighborhood MERGE contains a lot of properties, and so each time it's trying to match on every single one of those properties before deciding if it should create it or not. I'd suggest something like:

MERGE (neighborhood:Neighborhood {name: line.NeighborhoodName})
ON CREATE SET
     neighborhood.size = toInt(line.NeighborhoodSize),
     neighborhood.nickname = coalesce(line.NeighborhoodNN, ""),
     ... 50 other features
    })

Should I use Neo4j's Import Tool or Load Command to Insert Several Million Rows?

1 Answers