4
votes

I have a large dataset (about 1B nodes and a few billion relationships) that I am trying to import into Neo4j. I am using the Neo4j import tool. The nodes finished importing in an hour, however since then, the importer is stuck in a node index preparation phase (unless I am reading the output below incorrectly) for over 12 hours now.

... Available memory: Free machine memory: 184.49 GB Max heap memory : 26.52 GB

Nodes [>:23.39 MB/s---|PROPERTIE|NODE:|LAB|*v:37.18 MB/s---------------------------------------------] 1B Done in 1h 7m 18s 54ms Prepare node index [*SORT:11.52 GB--------------------------------------------------------------------------------]881M ...

My question is how can I speed this up? I am thinking the following: 1. Split up the import command for nodes and relationships and do the nodes import. 2. Create indexes on the nodes 3. Do merge/match to get rid of dupes 4. Do rels import.

Will this help? Is there something else I should try? Is the heap size too large (I think not, but would like an opinion)?

Thanks.

UPDATE
I also tried importing exactly half that data on the same machine and it gets stuck again in that phase at roughly the same amount of time (proportionally). So I have mostly eliminated disk space and memory as an issue.
I have also checked my headers (since I noticed that other people ran into this problem when they had incorrect headers) and they seem correct to me. Any suggestions on what I else should be looking at?

FURTHER UPDATE
Ok so now it is getting kind of ridiculous. I reduced my data size down to just one large file (about 3G). It only contains nodes of a single kind and only has ids. So the data looks something like this

1|Author
2|Author
3|Author

and the header (in a separate file) looks like this

authorID:ID(Author)|:LABEL

And my import still gets stuck in the sort phase. I am pretty sure I am doing something wrong here. But I really have no clue what. Here is my command line to invoke this
/var/lib/neo4j/bin/neo4j-import --into data/db/graph.db --id-type string --delimiter "|" \ --bad-tolerance 1000000000 --skip-duplicate-nodes true --stacktrace true --ignore-empty-strings true \ --nodes:Author "data/author/author_header_label.csv,data/author/author_half_label.csv.gz"


Most of the options for bad-tolerance and skip-duplicate-nodes are there to see if I can make it get through the import somehow at least once.

2
Which version of Neo4j is this?Mattias Finné
It is the latest 2.3.1panache
I take that back. it is 2.2.5panache
Gah... it is 2.3 (Ent trial edition). The reason I got confused was I checked the README file and for some reason that says 2.2.5. Even though my installation was from here neo4j.com/artifact.php?name=neo4j-enterprise-2.3.0-unix.tar.gzpanache
Can you see if there are many CPUs active at this point? Does it progress at all beyond 881M ? Could you get a thread dump and attach here at this point?Mattias Finné

2 Answers

1
votes

I think I found the issue. I was using some of the tips here
http://neo4j.com/developer/guide-import-csv/#_super_fast_batch_importer_for_huge_datasets where it says I can re-use the same csv file with different headers -- once for nodes and once for relationships. I underestimated the 1-n (ness) of the data I was using, causing a lot of duplicates on the ID. That stage was basically almost all spent on trying to sort and then dedupe. Re-working my queries to extract the data split into nodes and rels files, fixed that problem. Thanks for looking into this!
So basically, ideally always having separate files for each type of node and rel will give fastest results (at least in my tests).

0
votes

Have a look at the batch importer I wrote for a stress test:

https://github.com/graphaware/neo4j-stress-test

I used both neo4j index and in memory map between two commit. It is really fast and works for both version of neo4j.

Ignore the tests and get the batch importer.