8
votes

I'm building an ArangoDB edge collection that consists of many "types". By type, think animal species taxonomy.

I will be building a graph that connects all of these. Example: parent/child of ancient homo species: Homo habilis->Homo floresiensis->Homo erectus->Homo sapiens

Putting they different types in different collections would only be for superficial organizational reasons. There's a small possibility that it would be useful in the future for features I haven't thought of yet.

My specific questions is: Does building graphs in ArangoDB, that uses multiple collections, take a performance hit? Will using one large collection be more efficient for graphs?

Answering the first comment: If I break this out into different edge collections, it would be 4 collections with about 300,000 rows in each. Type can have multiple parents and children. The types of queries would be shortest path and any connectedness between each. If that makes sense? 6 degrees of Kevin Bacon type thing.

EDIT: Please see the comments for some questions and answers. Almost every single query will span multiple types. Many queries will be 5-7 vertices deep. This project will almost exclusively be READING... I'm not worried about write speed at all.

EDIT 2: Will I be using a single instance or a distributed cluster? Honestly, either! Whatever will speed up reads. You tell me.

2
The answer will probably depend on the types of queries you will be running. Could you be more specific about that, and also tell us how many different types of edge collections you envision? You only gave one example (parent/child). It might also be helpful to know how many node collections you expect, and roughly how many nodes?peak
Thanks. I updated my question with more details.Chemdream
Will single queries typically span multiple edge collections? Could you give an example of a second edge collection, as well as an example of a query that DOES span multiple edge collections?peak
Almost ever single query would span multiple data collections but only a single edge collection.Chemdream

2 Answers

5
votes

In the single server setup, using multiple collections does not have any penalty. Especially if your query does not span all edge collections, it will be faster to perform lookups on smaller collections.

How much faster / slower this will depends on the storage engine (rocksdb / mmfiles). Given that you want to go for maximum read performance mmfiles will be likely faster.

3
votes

I've got a taxonomy project in ArangoDB that seems roughly equivalent in terms of the data record count that you report.

This amount of data presents no performance challenges to ArangoDB. I've chosen to focus on modeling the relationships to best represent the dataset and have not regretted this.

In your example I'd probably have one collection for the species nodes. And start with one collection for the 'begats' edge collection to capture the species evolution pathways.

If there are multiple schools of thought, multiple classifications, or other frameworks that describe alternate pathing between the species then I'd be looking at capturing each in a different edge collection.

For example if one taxonomy pathing is arrived at by jaw shape, another always uses the pelvis, if countryX has another method, and another is DNA based it could be instructive to dedicate an edge collection to each. You'd be creating alternative interconnect networks using exactly / mostly same set of species nodes.

Species taxonomy isn't my field and the examples are probably nonsense. But I'd suggest not missing the opportunity to structure the data in the most useful way. The performance will very likely not be an issue.