Why Neo4j CSV loader doesn't increment the load with large number of records

Question

I have a csv file with following columns-

Child_Object_ID;
Child_Object_Name;
Child_Object_Type;
Parent_Object_ID;
Parent_Object_Name;
Parent_Object_Type

As the name goes, node that contains (Child_Object_ID Child_Object_Name and Child_Object_Type) is the child for (Parent_Object_ID Parent_Object_Name and Parent_Object_Type). These Parent Nodes could be a child for some other Parent node.

This CSV file contains 1.1 million of records. The problem that I am facing, while loading is after 100K records, I dont see any increment in the load process. But the Loading process was continuously running but I don't see any further nodes or relationship are being built.

I am using the following Cypher query to load the data into Neo4j Windows edition-

CREATE INDEX ON :Object(Object_ID)
CREATE INDEX ON :Object(Object_ID, Object_Name, Object_Type)
CREATE INDEX ON :Object(Object_Type)

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///file1.csv" AS csvLine
MERGE  (object1:Object {Object_ID:csvLine.CHILD_OBJECT_ID, Object_Name:csvLine.CHILD_OBJECT_NAME, Object_Type:csvLine.CHILD_OBJECT_TYPE})
MERGE  (object2:Object {Object_ID:csvLine.PARENT_OBJECT_ID, Object_Name:csvLine.PARENT_OBJECT_NAME, Object_Type:csvLine.PARENT_OBJECT_TYPE})
MERGE (object1)-[:Child_Of]->(object2)

Is using the command line CSV import tool an option? If you can start with an empty database, the command line tool is a lot faster and should load 1.1M records well under a minute. — Gabor Szarnyas
Hi Gabor Szarnyas, Yes I am loading this in a empty database. I downloaded windows version of Neo4j. So the windows version also has the command line tool? — Hari
Yes, it is available in the Windows version, in the bin directory. — Gabor Szarnyas
To use the import tool, you'll need to have at least two CSVs: one for nodes and another for relationships, both following a specific header format. As I stated in my first comment, this loader is a lot faster than LOAD CSV, however, it requires a bit of tinkering. So it is only worth doing if load time really is really an issue (i.e. a load time of ~1 hour is unacceptable for your use case). — Gabor Szarnyas

InverseFalcon InverseFalcon · Accepted Answer · 2017-11-30T01:09:15

A problem with your load query is that there is an Eager operation in the plan, which will prevent the PERIODIC COMMIT batching (you should see a warning in the query input box for this query, check out the warning message).

Without batching, your import is likely running into memory issues.

To avoid the eager operation, try an import query that only MERGEs all nodes with a single variable. After that's done, run a load that uses MATCH for both child and parent (which will match to existing nodes) then MERGE the relationship.

Here's an article (older, but still applicable) on avoiding eager operations.

Why Neo4j CSV loader doesn't increment the load with large number of records

2 Answers