1
votes

I have been using Neo4j for quite a while now. I ran this query earlier before my computer crashed 7 days ago and somehow unable to run it now. I need to create a graph database out of a csv of bank transactions. The original dataset has around 5 million rows and has around 60 columns.

This is the query I used, starting from 'Export CSV from real data' demo by Nicole White:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Transactions_with_risk_scores.csv" AS line
WITH DISTINCT line, SPLIT(line.VALUE_DATE, "/") AS date
WHERE line.TRANSACTION_ID IS NOT NULL AND line.VALUE_DATE IS NOT NULL
MERGE (transaction:Transaction {id:line.TRANSACTION_ID})
SET transaction.base_currency_amount =toInteger(line.AMOUNT_IN_BASE_CURRENCY),
transaction.base_currency = line.BASE_CURRENCY,
transaction.cd_code = line.CREDIT_DEBIT_CODE,
transaction.txn_type_code = line.TRANSACTION_TYPE_CODE,
transaction.instrument = line.INSTRUMENT,
transaction.region= line.REGION,
transaction.scope = line.SCOPE,
transaction.COUNTRY_RISK_SCORE= line.COUNTRY_RISK_SCORE,
transaction.year = toInteger(date[2]),
transaction.month = toInteger(date[1]),
transaction.day = toInteger(date[0]);

I tried:

  1. Using LIMIT 0 before running query as per Micheal Hunger's suggestion in a post about 'Loading Large datasets'.

  2. Used single MERGE per statement (this is first merge and there are 4 other merges to be used) as suggested by Michael again in another post.

  3. Tried CALL apoc.periodic.iterate and apoc.cypher.parallel but doesn't work with LOAD CSV (seem to work only with MERGE and CREATE queries without LOAD CSV). I get following error with CALL apoc.periodic.iterate(""): Neo.ClientError.Statement.SyntaxError: Invalid input 'f': expected whitespace, '.', node labels, '[', "=~", IN, STARTS, ENDS, CONTAINS, IS, '^', '*', '/', '%', '+', '-', '=', '~', "<>", "!=", '<', '>', "<=", ">=", AND, XOR, OR, ',' or ')' (line 2, column 29 (offset: 57))

  4. Increased max heap size to 16G as my laptop is of 16GB RAM. Btw finding it difficult to write this post as I tried running again now with 'PROFILE ' and it is still running since an hour.

Help needed to load query of this 5 million rows dataset. Any help would highly be appreciated.Thanks in advance! I am using Neo4j 3.5.1 on PC.

1
Can you add the output of the :schema command along in your question, so we have a view on your index to evaluateChristophe Willemsen
@ChristopheWillemsen currently running the query, will output schema after I give up on it.Bharat Ram Ammu
3. currently it's a syntax error but you may get null pointer later if run MERGE in parallelRajendra Kadam
You are using MERGE then why are you using DISTINCT? I am not sure if it's causing any issues, still curious.Rajendra Kadam

1 Answers

1
votes
  1. MOST IMPORTANT: Create Index/Constraint on the key property.

CREATE CONSTRAINT ON (t:Transaction) ASSERT t.id IS UNIQUE;

  1. Don't set the max heap size to full of system RAM. Set it to 50%.

  2. Try ON CREATE SET instead of SET.

  3. You can also use apoc.periodic.iterate to load the data, but USING PERIODIC COMMIT is also fine.

  4. Importantly, if you are 'USING PERIODIC COMMIT' and the query is not finishing or running out of memory, it is likely because of using Distinct. Avoid Distinct as duplicate transactions will be handled by MERGE.

NOTE: (If you use apoc.periodic.iterate to MERGE nodes/relationships with parameter parallel=true then it fails with NULL POINTER EXCEPTION. use it carefully)

Questioner edit: Removing Distinct in 3rd line for Transaction node and re-running the query worked!