2
votes

I'm using neo4j-import command line to load large csv files into neo4j. I've tested the command line with subset of the data and it works well. The size of csv file is about 200G, containing ~10M nodes and ~B relationships. Currently, I'm using default neo4j configuration and it takes hours to create nodes, and it got stuck at [*SORT:20.89 GB-------------------------------------------------------------------------------] 0 I'm worried that it will take even longer time to create relationships. Thus, I would like to know possible ways to speedup data import.

  1. It's a 16GB machine, and the neo4j-import output message shows the following. free machine memory: 166.94 MB Max heap memory : 3.48 GB Should I change neo4j configuration to increase memory? Will it help?

  2. I'm setting neo4j-import --processes=8. However, the CPU usages of the JAVA command is only about ~1%. Does it look right?

  3. Can someone give me a ballpark number of loading time, given the size of my dataset? It's a 8-core, 16GB memory standalone machine.

  4. Anything else I should look at to speedup the data import?


Updated:

  1. The machine does not have SSD disk

  2. I run top command, and it shows that 85% of RAM is being used by the JAVA process, which I think belongs to the neo4j-import command.

  3. The import command is: neo4j-import --into /var/lib/neo4j/data/graph.db/ --nodes:Post Posts_Header.csv,posts.csv --nodes:User User_Header.csv,likes.csv --relationships:LIKES Likes_Header.csv,likes.csv --skip-duplicate-nodes true --bad-tolerance 100000000 --processors 8

4.Posts_Header:Post_ID:ID(Post),Message:string,Created_Time:string,Num_Of_Shares:int,e:IGNORE, f:IGNORE User_Header:a:IGNORE,User_Name:string,User_ID:ID(User) Likes_Header: :END_ID(Post),b:IGNORE,:START_ID(User)

I ran the sample data import and it's pretty fast, like several seconds. Since I use the default neo4j heap setting and default Java memory setting, will it help if I configure these numbers?

1
Importing 10M nodes should take a couple of minutes at most. Can you give some example of values in the :ID field in your csv files? Also can you grab a thread dump when it gets to the point where it stops? Thanks in advance.Mattias Finné
I've fixed the problem following the suggestion at stackoverflow.com/questions/33711258/… We had the exactly same issue as we re-use the same csv file with different headers. Neo4j-import got stuck at stage where it tries to sort and dedupe nodes.Idealist
I suggest that you can point out this issue in the doc neo4j.com/developer/guide-import-csv/… You can specify that if you reuse the same file for nodes and relationships with different headers, be aware that there will be duplicate IDs that the importer needs to dedupe (which might take time).Idealist
What do you mean by “if you reuse the same file for nodes and relationships with different headers”? I am facing the same issues but I have the header in each file (not separate) and each file contains different labels, but am stuck on the sorting part too.dter

1 Answers

3
votes

Some questions:

  • What kind of disk do you have (SSD is preferable).
  • It also seems all your RAM is already used up, check with top or ps what other processes use the memory and kill them.
  • Can you share the full neo4j-import command?
  • What does a sample of your CSV and the header line look like?

It seems that you have a lot of properties? Are they all properly quoted? Do you really need all of them in the graph?

Try with a sample first, like head -100000 file.csv > file100k.csv

Usually it can import 1M records / s, with a fast disk. That includes nodes, property and relationship-records.