Problem: How to load ~8 GB of data, >10 million rows, of the following format into Neo4j efficiently. I am using the DocGraph data set which shows relationships between Medicare providers. The dataset is a csv with columns:
From_ID, To_ID, Count_Patients, Count_Transacts, Avg_Wait, Stdv_Wait
From_ID means ID of a doctor making a referral. To_ID is the doctor who receives the referral. The last four columns are relationship properties. Any ID in the first or 2nd column can reappear in either column, because providers can have many relationships in either direction.
Here is the basic query I've come up with (very new to Cypher but adept at SQL):
LOAD CSV FROM "url"
CREATE (n:provider {NPI : line[0]})
WITH line, n
MERGE (m:provider {NPI : line[1]})
WITH m,n, line
MERGE (n)-[r:REFERRED {patients: line[2], transacts: line[3], avgdays: line[4], stdvdays: line[5]}]->(m)
It seems to work with a small subset of data but last time I tried it on the full dataset it broke my neo4j and it kept timing out when I tried to restart it, so I had to terminate my EC2 instance and start from scratch.
Appreciate any advice I can get and help with the Cypher query. Also, I am planning to merge this data with additional Medicare data with more properties about the nodes e.g. doctor specialty, location, name, etc, so let me know how I should take that into consideration.
Instance details: Ubuntu 18.04, m5ad.large (2 vCPUS, 8GB RAM, 75GB SSD)