ArangoDB - arangoimp on csv files is very slow on large datasets

Question

I am new to arango. I'm trying to import some of my data from Neo4j into arango. I am trying to add millions of nodes and edges to store playlist data for various people. I have the csv files from neo4j. I ran a script to change the format of the csv files of node to have a _key attribute. And the edges to have a _to and _from attribute. When I tried this on a very small dataset, things worked perfectly and I could see the graph on the UI and perform queries. Bingo!

Now, I am trying to add millions of rows of data ( each arangoimp batch imports a csv with about 100,000 rows ). Each batch has 5 collections ( a different csv file for each) After about 7-8 batches of such data, the system all of a sudden gets very slow, unresponsive and throws the following errors:

ERROR error message: failed with error: corrupted collection This just randomly comes up for any batch, though the format of the data is exactly the same as the previous batches
ERROR Could not connect to endpoint 'tcp://127.0.0.1:8529', database: '_system', username: 'root' FATAL got error from server: HTTP 401 (Unauthorized)'
Otherwise it just keeps processing for hours with barely any progress

I'm guessing all of this has to do with the large number of imports. Some post said that maybe I have too many file descriptors, but I'm not sure how to handle it.

Another thing I notice, is that the biggest collection of all the 5 collections, is the one that mostly gets the errors ( although the other ones also do). Do the file descriptors remain specific to a certain collection, even on different import statements?

Could someone please help point me in the right direction? I'm not sure on how to begin debugging the problem

Thank you in advance

dothebart dothebart · Accepted Answer · 2019-02-25T17:48:34

The problem here is, that the server must not be overrun in terms of available disk I/O. The situation may benefit from more available RAM. The system also has to maintain indices while importing, which increases complexity with the number of documents in the collections.

With ArangoDB 3.4 we have improved Arangoimp to maximize throughput, without maxing out which should resolve this situation and remove the necessity to split the import data into chunks.

However, as its already is, the CSV format should be prepared, JSONL is also supported.

ArangoDB - arangoimp on csv files is very slow on large datasets

1 Answers