I'm working with approx 17mm prescription claims each containing the following fields (subset) :
claim_id (one record per claim)
patient_id
drug_id
provider_id
My nodes are same as the fields above and the relationships are:
patient - [:FILLED] -> prescription
provider - [:WROTE] -> prescription
prescription -[:CONTAINS] -> drug
The input file is not ordered, i.e. patient / provider / drug
can appear at any place in the file.
I'm using py2neo, Cypher MERGE
and batch size of 1,000 rows to process to ensure that there are no duplicate patients, providers or drugs being created.
Problem
: Performance - it's taking about a minute per batch (4 nodes + 4 relationships X 1,000) and that time is increasing as the graph is growing.
Question
: Is there a better way of doing this? Open to non-python suggestions.