Importing large dataset into neo4j (with a twist) - slow

Question

I'm working with approx 17mm prescription claims each containing the following fields (subset) :

claim_id (one record per claim)
patient_id
drug_id
provider_id

My nodes are same as the fields above and the relationships are:

patient - [:FILLED] -> prescription
provider - [:WROTE] -> prescription
prescription -[:CONTAINS] -> drug

The input file is not ordered, i.e. patient / provider / drug can appear at any place in the file.

I'm using py2neo, Cypher MERGE and batch size of 1,000 rows to process to ensure that there are no duplicate patients, providers or drugs being created.

Problem: Performance - it's taking about a minute per batch (4 nodes + 4 relationships X 1,000) and that time is increasing as the graph is growing.

Question: Is there a better way of doing this? Open to non-python suggestions.

Michael Hunger Michael Hunger · Accepted Answer · 2014-01-27T23:52:50

If you want to do csv + cypher, you can look at the shell-import tools:

https://github.com/jexp/neo4j-shell-tools#cypher-import

there csv columns are mapped to parameters for your cypher statement.

Make sure to create unique constraints / indexes upfront (for 2.0), so you can leverage them during the insert with MERGE

if you want to have a dynamic rel-type you can use #{type} in your statement (which is not resolved by cypher but the import tool)

Check out the CSV batch-importer, that should be able to import your data in a few minutes.

See: https://github.com/jexp/batch-import/tree/20#neo4j-csv-batch-importer

Just create one or more csv files for nodes and for relationships

Importing large dataset into neo4j (with a twist) - slow

3 Answers