1
votes

I am importing the data around 12 million nodes and 13 million relationships.

First I used the csv import with periodic commit 50000 and divided the data into different chunks, but still its taking too much time.

Then I saw the batch insertion method. But for the batch insertion method I have to create new data sets in excel sheet.

Basically I am importing the data from SqlServer: first I save the data into csv, then import it into my neo4j.

Also, I am using the neo4j community version. I did change the properties for the like all i had found on stackoverflow. But still initially with preiodic commit 50K it goes faster but after 1 million it takes too much time.

Is there anyway to import this data directly from sql in short span of time, as neo4j is famous for its fast working with big data? Any suggestions or help?

Here is the LOAD CSV used (index on numbers(num)) :

USING PERIODIC COMMIT 50000 load csv with headers from "file:c:/Users/hasham munir/Desktop/Numbers/CRTest2/Numbers.csv" 
AS csvLine fieldterminator ';' 
Merge (Numbers:Number {num: csvLine.Numbers}) return * ; 

USING PERIODIC COMMIT 50000 load csv with headers from "file:c:/Users/hasham munir/Desktop/Numbers/CRTest2/Level1.csv" 
AS csvLine fieldterminator ';' 
MERGE (TermNum:Number {num: csvLine.TermNum}) 
MERGE (OrigNum:Number {num: (csvLine.OrigNum)}) 
MERGE (OrigNum)-[r:CALLS ]->(TermNum) return * ;
2
it would help us if you can share your load csv command and your schema indexes and constraints - Christophe Willemsen
create index on :Number(num); USING PERIODIC COMMIT 50000 load csv with headers from "file:c:/Users/ha/Desktop/Numbers/CRTest2/Numbers.csv" AS csvLine fieldterminator ';' Merge (Numbers:Number {num: csvLine.Numbers})return * ; USING PERIODIC COMMIT 50000 load csv with headers from "file:c:/Users/ha/Desktop/Numbers/CRTest2/Level1.csv" AS csvLine fieldterminator ';' MERGE (TermNum:Number {num: csvLine.TermNum}) MERGE (OrigNum:Number {num: (csvLine.OrigNum)}) MERGE (OrigNum)-[r:CALLS ]->(TermNum) return * ; - Ch HaXam
please modify your answer with the load csv. can you add also your java heap memory settings and the version of neo4j you are using - Christophe Willemsen
i am using neo4j community version 2.2.0 and java heap memory setting is -Xmx512m create index on :Number(num); USING PERIODIC COMMIT 50000 load csv with headers from "file:c:/Users/hasham munir/Desktop/Numbers/CRTest2/Numbers.csv" AS csvLine fieldterminator ';' Merge (Numbers:Number {num: csvLine.Numbers}) - Ch HaXam

2 Answers

3
votes

How long is it taking?

To give you a reference, my db is about 4m nodes, 650,000 unique relationships, ~10m-15m properties (not as large, but should provide an idea). It takes me less than 10 minutes to load in the nodes file + set multiple labels, and then load in the relationships file + set the relationships (all via LOAD CSV). This is also being done on a suped up computer, but if yours is taking hours, I would make some tweaks.

My suggestions are as follows:

  • Are you intentionally returning the nodes after the MERGE? I can't imagine you are doing anything with it, but either way, consider removing the RETURN *. With RETURN *, you're returning all nodes, relationships, and paths found in the query and that's bound to slow things down. (http://neo4j.com/docs/stable/query-return.html#return-return-all-elements)

  • Is the "num" field meant to be unique? If so, consider adding the following constraints (NOTE: this will also create the index, so no need to create it separately). I think this might speed up the MERGE (I'm not sure on that), though see next point.

    CREATE CONSTRAINT ON (Numbers:Number) ASSERT Numbers.num IS UNIQUE;

  • If the num field is unique AND this is a brand new database (i.e. you're starting from scratch when you run this script), then call CREATE to create the nodes, rather than MERGE (for the creation of the nodes only).

  • As was already mentioned by Christophe, you should definitely increase the heap size to around 4g.

Let us know how it goes!

EDIT 1 I have not been able to find much relevant information on memory/performance tuning for the Windows version. What I have found leaves me with a couple of questions, and is potentially outdated.

This is potentially outdated, but provides some background on some of the different settings and the differences between Windows and Linux.
http://blog.bruggen.com/2014/02/some-neo4j-import-tweaks-what-and-where.html

Those differences between Windows & Linux have themselves changed from one version to the next (as demonstrated with the following links)
Cypher MATCH query speed,
https://stackoverflow.com/a/29055966/4471711

Michael's response above seems to indicate that if you're NOT running a java application with Neo4j, you don't need to worry about the heap (-Xmx), however that doesn't seem right in my mind given the other information I saw, but perhaps all of that other info is prior to 2.2.

I have also been through this. http://neo4j.com/docs/stable/configuration.html

So, what I have done is set both heap (-Xmx in the neo4j.vmoptions) and the pagecache to 32g.

1
votes

Can you modify your heap settings to 4096MB.

Also, in the second LOAD CSV, are the numbers used for the two first MERGE already in the database ? If yes use MATCH instead.

I would also commit at a level of 10000.