1
votes

I am working on creating a graph database in neo4j for a CALL dataset. The dataset is stored in csv file with following columns: Source, Target, Timestamp, Duration. Here Source and Target are Person id's (numeric), Timestamp is datetime and duration is in seconds (integer).

I modeled my graph where person are nodes(person_id as property) and call as relationship (time and duration as property). There are around 2,00,000 nodes and around 70 million relationships. I have a separate csv files with person id's which I used to create the nodes. I also added uniqueness constraint on the Person id's.

CREATE CONSTRAINT ON ( person:Person ) ASSERT (person.pid) IS UNIQUE

I didn't completely understand the working of bulk import so I wrote a python script to split my csv into 70 csv's where each csv has 1 million nodes (saved as calls_0, calls_1, .... calls_69). I took the initiative to manually run a cypher query changing the filename every time. It worked well(fast enough) for first few(around 10) files but then I noticed that after adding relationship from a file, the import is getting slower for the next file. Now it is taking almost 25 minutes for importing a file. Can someone link me to an efficient and easy way of doing it?

Here is the cypher query:

:auto USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///calls/calls_28.csv' AS line
WITH toInteger(line.Source) AS Source, 
datetime(replace(line.Time,' ','T')) AS time,
toInteger(line.Target) AS Target,
toInteger(line.Duration) AS Duration
MATCH (p1:Person {pid: Source})
MATCH (p2:Person {pid: Target})
MERGE (p1)-[rel:CALLS {time: time, duration: Duration}]->(p2)
RETURN count(rel)

I am using Neo4j 4.0.3

2
Have you indexed the node properties used in the merge? CREATE INDEX [index_name] FOR (n:LabelName) ON (n.propertyName)David A Stumpf
I have added uniqueness constraint on the node property. According to the documentation, "Adding the unique constraint will implicitly add an index on that property"kinger
Understood. An index speeds LOAD CSV very significantly. But (I don't know the answer) does a constraint impose an extra burden on LOAD CSV. Preventing a duplicate (constraint) is different than finding whether it exists. It may take an experiment: remove the constraint and add only an index and then compare the results.David A Stumpf

2 Answers

0
votes

Your MERGE clause has to check for an existing matching relationship (to avoid creating duplicates). If you added a lot of relationships between Person nodes, that could make the MERGE clause slower.

You should consider whether it is safe for you to use CREATE instead of MERGE.

0
votes

Is much better if you export the match using the ID of each node and then create the relationship.

POC

CREATE INDEX ON :Person(`pid`);

CALL apoc.export.csv.query("LOAD CSV WITH HEADERS FROM 'file:///calls/calls_28.csv' AS line
WITH toInteger(line.Source) AS Source, 
datetime(replace(line.Time,' ','T')) AS time,
toInteger(line.Target) AS Target,
toInteger(line.Duration) AS Duration
MATCH (p1:Person {pid: Source})
MATCH (p2:Person {pid: Target})
RETURN ID(a) AS ida,ID(b) as idb,time,Duration","rels.csv", {});

and then

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:////rels.csv' AS row
MATCH (a:Person) WHERE ID(a) = toInt(row.ida)
MATCH (b:Person) WHERE ID(b) = toInt(row.idb)
MERGE (b)-[:CALLS {time: row.time, duration: Duration}]->(a);

For me this is the best way to do this.