Neo4j Performance for large dataset

Question

I am trying to load large dataset into neo4j-3 and looking for the options. I found one neo4j-import but the problem with that is it is for initial load only. I have to load 2M records around every week. I tried loading through shell but having some performance issue, I tried following. 1) Creating constraint upfront. 2) Creating Node and relationships in separate query. 3) Heap space 8G 4) dbms.memory.pagecache 4G

Many times the import just hangs and does nothing for hours.

Edit - CSV load being executed:

USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS
FROM "file:///my_sds_39_joe.csv"
AS row
OPTIONAL MATCH (per:Person {UID : "Person."+row.player_cardnum})
WHERE per IS NULL
MERGE (p:Person {CardNumber : row.player_cardnum})
ON CREATE SET p.Creation Date = timestamp(), p.Modification Date = timestamp() ;

@user2813165 Please be sure to add details by editing your question, not by burying in comments (especially code/queries, which are fairly unreadable as comments). I edited your question to include the csv-load from your comment, so you should delete the comment now. — David Makogon

InverseFalcon InverseFalcon · Accepted Answer · 2016-08-28T18:30:09

EDIT

On a second look, seems like you're trying to implement some kind of conditional logic to your insert.

It looks like what you're trying to do is figure out if a :Person exists with a UID (derived from some concatenation with row.player_cardnum), and in the case where that :Person doesn't exist and the match fails, MERGE a :Person with the CardNumber given by row.player_cardnum.

If this is your goal, you're ALMOST there with your query. The problem is with your WHERE clause.

Understand that WHERE clauses are linked with a preceding MATCH, OPTIONAL MATCH, or WITH, and only affects the linked clause.

With that WHERE on that OPTIONAL MATCH, per will always be null, but more importantly, your row will still exist, and the following MERGE will ALWAYS take place for all rows in the CSV. This is probably the source of your slowdown, as it's creating new :Person nodes for all rows.

If you're trying to null out the row completely when the OPTIONAL MATCH hits on an existing :Person (so the MERGE won't happen in that case), you'll need to add a WITH clause, and make sure your WHERE clause is applied to it instead of the OPTIONAL MATCH.

Additionally, make sure that you have either unique constraints or indexes on Person.UID and Person.CardNumber. As for the UID match, I've heard that indexes are not used when there's some kind of string concatenation of the thing you're matching upon, so you may need to assemble it first and pass it in with a WITH.

Your final query would look like this:

USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS
FROM "file:///my_sds_39_joe.csv"
AS row
// first build the UID so we can take advantage of the index
WITH row, "Person." + row.player_cardnum AS UID
OPTIONAL MATCH (per:Person {UID : UID})
// the WHERE now applies to the WITH, which will filter out and null out the row when an OPTIONAL MATCH is found
WITH row, per
WHERE per IS NULL
MERGE (p:Person {CardNumber : row.player_cardnum})
ON CREATE SET p.Creation Date = timestamp(), p.Modification Date = timestamp() ;

Neo4j Performance for large dataset

1 Answers