0
votes

I'm having issues importing a large volume of data into a Neo4j instance using the Cypher LOAD CSV command. I'm attempting to load in roughly 253k user records each with a unique user_id. My first step was to add a unique constraint on tje label to make sure the user was only being run once

CREATE CONSTRAINT ON (b:User) ASSERT b.user_id IS UNIQUE;

I then tried to run LOAD CSV with periodic commits to pull this data in.

This query failed so I tried to merge User record before setting

USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_users.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id)})-[:REGISTERED_TO]->(t)
set p.created=toInt(line.created), p.completed=toInt(line.completed);

Modifying the periodic commit value has made no difference, the same error is returned.

USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_buddies.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id), created:toInt(line.created), completed:toInt(line.completed)})-[:REGISTERED_TO]->(t);

I receive the following error:

LoadCsvStatusWrapCypherException: Node 9752 already exists with label Person and property "hpcm_uk_buddy_id"=[2446] (Failure when processing URL 'file:/home/data/uk_buddies.csv' on line 253316 (which is the last row in the file). Possibly the last row committed during import is line 253299. Note that this information might not be accurate.)

The numbers seem to match up roughly, the CSV file contains 253315 records in total. The periodic commit doesn't seem to have taken effect either, a count of nodes returns only 5446 rows.

neo4j-sh (?)$ match (n) return count(n);
+----------+
| count(n) |
+----------+
| 5446     |
+----------+
1 row
768 ms

I can understand the number of nodes being incorrect if this ID is only roughly 5000 rows into the CSV file. But is there any technique or command I can use to succeed this import?

1

1 Answers

1
votes

You're falling victim to a common mistake with MERGE I think. Relative to cypher query, seriously this would be like in my top 10 FAQs about common problems with cypher. See you're doing this:

USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_buddies.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id), created:toInt(line.created), completed:toInt(line.completed)})-[:REGISTERED_TO]->(t);

The way merge works, that last merge matches on the entire relationship, not just on the user node. So probably, you're creating duplicate users that you shouldn't be. When you run this merge, even if a user with those exact properties already exists, the relationship to the t node doesn't, so it attempt to create a new user node with those attributes, to connect to t, which isn't what you want.

The solution is to merge the user individually, then separately merge the relationship path, like this:

USING PERIODIC COMMIT 1000
load csv with headers from "file:///home/data/uk_buddies.csv" as line
match (t:Territory{territory:"uk"})
merge (p:User {user_id:toInt(line.user_id), created:toInt(line.created), completed:toInt(line.completed)})
merge (p)-[:REGISTERED_TO]->(t);

Note the two merges at the end. One creates just the user. If the user already exists, it won't try to create a duplicate, and you should hopefully be OK with your constraint (assuming there aren't two users with the same user_id, but different created values). After you've merged just the user, then you merge the relationship.

The net result of the second query is the same, but shouldn't create duplicate users.