2
votes

I'm working with approx 17mm prescription claims each containing the following fields (subset) :

claim_id (one record per claim)
patient_id
drug_id
provider_id

My nodes are same as the fields above and the relationships are:

patient - [:FILLED] -> prescription
provider - [:WROTE] -> prescription
prescription -[:CONTAINS] -> drug

The input file is not ordered, i.e. patient / provider / drug can appear at any place in the file.

I'm using py2neo, Cypher MERGE and batch size of 1,000 rows to process to ensure that there are no duplicate patients, providers or drugs being created.

Problem: Performance - it's taking about a minute per batch (4 nodes + 4 relationships X 1,000) and that time is increasing as the graph is growing.

Question: Is there a better way of doing this? Open to non-python suggestions.

3

3 Answers

2
votes

If you want to do csv + cypher, you can look at the shell-import tools:

https://github.com/jexp/neo4j-shell-tools#cypher-import

there csv columns are mapped to parameters for your cypher statement.

Make sure to create unique constraints / indexes upfront (for 2.0), so you can leverage them during the insert with MERGE

if you want to have a dynamic rel-type you can use #{type} in your statement (which is not resolved by cypher but the import tool)

Check out the CSV batch-importer, that should be able to import your data in a few minutes.

See: https://github.com/jexp/batch-import/tree/20#neo4j-csv-batch-importer

Just create one or more csv files for nodes and for relationships

1
votes

You could also have a look at using Geoff through the load2neo extension. This supports uniqueness through its exclamation mark syntax so might be able to help you out.

The overall syntax looks very similar to Cypher with a few minor differences and py2neo has direct support for load2neo with the load_geoff method.

1
votes

Here's what I do for data sets with similar size in Python/py2neo:

Split the creation of unique nodes and relationships. Make sure to use a WriteBatch to speed up the process.

  1. Create all patient, provider, drug nodes and store the py2neo nodes in a Python dict with patient_id, provider_id or drug_id as key. Use dict to make sure each id is only created once.

  2. Go over your data again, create claim nodes and relationships to uniqe patient, provider, drug nodes. Py2neo allows to create a claim node and relationships for this node in the same batch.

    # write batch
    batch = WriteBatch(graph_db)
    
    for line in your_data:
        # your fields
        claim_id = ...
        patient_id = ...
    
        patient_node = my_dict_from_step_one[patient_id]
    
        claim_node = batch.create({'claim_id': claim_id})
        batch.create(rel(patient_node, "FILLED", claim_node))
    
    results = batch.submit()
    

17 m operations will make your batch explode. Try to submit every 1000 times or so.