1
votes

I have known how to import CSV file into neo4j graph database, but I found that they are all in fix numbers of columns like this:

id1,id2,id3,id4,id5

id2,id2,id3,id4,id5

id3,id2,id3,id4,id5

But I have a variable columns CSV file describing the relation between person. It looks like this:

id1,id2,id3,id4,id5

id2,id2,id3,id4,id5, id6, id7

id3,id2,id3

This means that the id1 person follow id2,id3,id4,id5, the id2 person follow id2,id3,id4,id5, id6, id7.

And this file is huge (about 6Gb), how should I import it into neo4j?

1

1 Answers

3
votes

Here are some hints on how to import using the Cypher LOAD CSV clause. To handle truly large data import tasks, you may want to look at the neo4j-import tool instead.

Handling varying numbers of columns is not a problem, since you can treat the each CSV file row as a collection of items.

You should import your data in 2 passes through the CSV file. In the first pass, create all the Person nodes. In the second pass, match the appropriate nodes and then create relationships between them. To greatly speed up the second pass, you should first create either an index or a uniqueness constraint (which will create an index for you) for matching Person nodes by ID.

I will assume that:

  • There is one row in your CSV file per Person, with the first column of each row having that person's unique ID.
  • The row for a Person will have only one column if that person does not follow anyone.
  • Your neo4j model looks something like this:

    (p1:Person {id: 123})-[:FOLLOWS]->(p2:Person {id: 234})

First, create a uniqueness constraint:

CREATE CONSTRAINT ON (p:Person) ASSERT p.id IS UNIQUE;

Then, create the Person nodes using the IDs in the first column of you CSV file. We use MERGE to ensure that LOAD does not abort (due to the uniqueness constraint) if there happened to be any duplicate IDs in column 1. If you are sure that there are no duplicate IDs, you can use CREATE instead, which should be faster. To avoid running out of memory, we process and commit 10000 rows at a time:

USING PERIODIC COMMIT 10000
LOAD CSV FROM "file:///varying.csv" AS row
MERGE (:Person {id: row[0]});

Finally, create the relationships between the appropriate Person nodes. This query uses USING INDEX hints to encourage Cypher to take advantage of the index (automatically created by the uniqueness constraint) to quickly find the appropriate Person nodes. Again, to avoid running out of memory, we process 10000 rows at a time:

USING PERIODIC COMMIT 10000
LOAD CSV FROM "file:///varying.csv" AS row
WITH row[0] AS pid1, row[1..] AS followed
UNWIND followed AS pid2
MATCH (p1:Person {id: pid1}), (p2:Person {id: pid2})
USING INDEX p1:Person(id)
USING INDEX p2:Person(id)
MERGE (p1)-[:FOLLOWS]->(p2);