I have a large dataset (about 1B nodes and a few billion relationships) that I am trying to import into Neo4j. I am using the Neo4j import tool. The nodes finished importing in an hour, however since then, the importer is stuck in a node index preparation phase (unless I am reading the output below incorrectly) for over 12 hours now.
... Available memory: Free machine memory: 184.49 GB Max heap memory : 26.52 GB
Nodes [>:23.39 MB/s---|PROPERTIE|NODE:|LAB|*v:37.18 MB/s---------------------------------------------] 1B Done in 1h 7m 18s 54ms Prepare node index [*SORT:11.52 GB--------------------------------------------------------------------------------]881M ...
My question is how can I speed this up? I am thinking the following: 1. Split up the import command for nodes and relationships and do the nodes import. 2. Create indexes on the nodes 3. Do merge/match to get rid of dupes 4. Do rels import.
Will this help? Is there something else I should try? Is the heap size too large (I think not, but would like an opinion)?
Thanks.
UPDATE
I also tried importing exactly half that data on the same machine and it gets stuck again in that phase at roughly the same amount of time (proportionally). So I have mostly eliminated disk space and memory as an issue.
I have also checked my headers (since I noticed that other people ran into this problem when they had incorrect headers) and they seem correct to me. Any suggestions on what I else should be looking at?
FURTHER UPDATE
Ok so now it is getting kind of ridiculous. I reduced my data size down to just one large file (about 3G). It only contains nodes of a single kind and only has ids. So the data looks something like this
1|Author
2|Author
3|Author
and the header (in a separate file) looks like this
authorID:ID(Author)|:LABEL
And my import still gets stuck in the sort phase. I am pretty sure I am doing something wrong here. But I really have no clue what. Here is my command line to invoke this
/var/lib/neo4j/bin/neo4j-import --into data/db/graph.db --id-type string --delimiter "|" \
--bad-tolerance 1000000000 --skip-duplicate-nodes true --stacktrace true --ignore-empty-strings true \
--nodes:Author "data/author/author_header_label.csv,data/author/author_half_label.csv.gz"
Most of the options for bad-tolerance and skip-duplicate-nodes are there to see if I can make it get through the import somehow at least once.