Here are some hints on how to import using the Cypher LOAD CSV
clause. To handle truly large data import tasks, you may want to look at the neo4j-import tool instead.
Handling varying numbers of columns is not a problem, since you can treat the each CSV file row as a collection of items.
You should import your data in 2 passes through the CSV file. In the first pass, create all the Person
nodes. In the second pass, match the appropriate nodes and then create relationships between them. To greatly speed up the second pass, you should first create either an index or a uniqueness constraint (which will create an index for you) for matching Person
nodes by ID.
I will assume that:
First, create a uniqueness constraint:
CREATE CONSTRAINT ON (p:Person) ASSERT p.id IS UNIQUE;
Then, create the Person
nodes using the IDs in the first column of you CSV file. We use MERGE
to ensure that LOAD
does not abort (due to the uniqueness constraint) if there happened to be any duplicate IDs in column 1. If you are sure that there are no duplicate IDs, you can use CREATE
instead, which should be faster. To avoid running out of memory, we process and commit 10000 rows at a time:
USING PERIODIC COMMIT 10000
LOAD CSV FROM "file:///varying.csv" AS row
MERGE (:Person {id: row[0]});
Finally, create the relationships between the appropriate Person
nodes. This query uses USING INDEX
hints to encourage Cypher to take advantage of the index (automatically created by the uniqueness constraint) to quickly find the appropriate Person
nodes. Again, to avoid running out of memory, we process 10000 rows at a time:
USING PERIODIC COMMIT 10000
LOAD CSV FROM "file:///varying.csv" AS row
WITH row[0] AS pid1, row[1..] AS followed
UNWIND followed AS pid2
MATCH (p1:Person {id: pid1}), (p2:Person {id: pid2})
USING INDEX p1:Person(id)
USING INDEX p2:Person(id)
MERGE (p1)-[:FOLLOWS]->(p2);