1
votes

Problem: How to load ~8 GB of data, >10 million rows, of the following format into Neo4j efficiently. I am using the DocGraph data set which shows relationships between Medicare providers. The dataset is a csv with columns:

From_ID, To_ID, Count_Patients, Count_Transacts, Avg_Wait, Stdv_Wait

From_ID means ID of a doctor making a referral. To_ID is the doctor who receives the referral. The last four columns are relationship properties. Any ID in the first or 2nd column can reappear in either column, because providers can have many relationships in either direction.

Here is the basic query I've come up with (very new to Cypher but adept at SQL):

LOAD CSV FROM "url"
CREATE (n:provider {NPI : line[0]}) 
WITH line, n 
MERGE (m:provider {NPI : line[1]}) 
WITH m,n, line 
MERGE (n)-[r:REFERRED {patients: line[2], transacts: line[3], avgdays: line[4], stdvdays: line[5]}]->(m)

It seems to work with a small subset of data but last time I tried it on the full dataset it broke my neo4j and it kept timing out when I tried to restart it, so I had to terminate my EC2 instance and start from scratch.

Appreciate any advice I can get and help with the Cypher query. Also, I am planning to merge this data with additional Medicare data with more properties about the nodes e.g. doctor specialty, location, name, etc, so let me know how I should take that into consideration.

Instance details: Ubuntu 18.04, m5ad.large (2 vCPUS, 8GB RAM, 75GB SSD)

2

2 Answers

1
votes

It seems very likely that your logic is flawed.

You should investigate whether multiple rows in your CSV file can have the same line[0] value. If so, your CREATE clause should be change to a MERGE, to avoid the creation of a potentially large number of duplicate provider nodes (and therefore also duplicate :REFERRED relationships).

0
votes

Did you try using

USING PERIODIC COMMIT 1000 ......

USING PERIODIC COMMIT 1000 LOAD CSV FROM "url" CREATE (n:provider {NPI : line[0]}) WITH line, n MERGE (m:provider {NPI : line[1]}) WITH m,n, line MERGE (n)-[r:REFERRED {patients: line[2], transacts: line[3], avgdays: line[4], stdvdays: line[5]}]->(m)