1
votes

I have a large bipartite graph (300M+ nodes) stored in ArangoDB using two collections and an edgelist. I'm trying to do an efficient traversal using AQL that starts from a node of one type with a particular label to find all other connected nodes of the same type with the same label. The resulting traversal could find anywhere between 2 and 150K nodes, though on average it will be around 10-20 nodes. It is important that a) I specify a large default max traversal depth (ie. 0..50) to ensure I find everything, but that b) AQL prunes paths so that most of the time it never reaches this max depth.

I have a query that gets the right results, but it does not appear to prune the paths, as it gets slower as I increase the max depth, even though the results do not change.

Here is the problem in miniature (picture here):

var cir = db._create("circles");
var dia = db._create("diamonds");
var owns = db._createEdgeCollection("owns");

var A = cir.save({_key: "A", color:'blue'});
var B = cir.save({_key: "B", color:'blue'});
var C = cir.save({_key: "C", color:'blue'});
var D = cir.save({_key: "D", color:'yellow'});
var E = cir.save({_key: "E", color:'yellow'});
var F = cir.save({_key: "F", color:'yellow'});
var G = cir.save({_key: "G", color:'red'});
var H = cir.save({_key: "H", color:'red'});

var d1 = dia.save({_key: "1"})_id;
var d2 = dia.save({_key: "2"})_id;
var d3 = dia.save({_key: "3"})_id;
var d4 = dia.save({_key: "4"})_id;
var d5 = dia.save({_key: "5"})_id;
var d6 = dia.save({_key: "6"})_id;

owns.save(A, d2, {});
owns.save(A, d5, {});
owns.save(A, d4, {});
owns.save(B, d4, {});
owns.save(C, d5, {});
owns.save(C, d6, {});
owns.save(D, d1, {});
owns.save(D, d2, {});
owns.save(E, d1, {});
owns.save(E, d3, {});
owns.save(F, d3, {});
owns.save(F, d4, {});
owns.save(G, d6, {});
owns.save(H, d6, {});
owns.save(H, d2, {});

Starting at the Node circle/A I want to find all connected vertices only stopping when I encounter a circle which is not blue.

The following AQL does what I want:

FOR v, e, p IN 0..5 ANY "circles/A" owns 
    FILTER p.vertices[* filter has(CURRENT, 'color')].color ALL == 'blue'
    return v._id

But the FILTER clause does not cause any pruning to occur. At least, as I said above, in the large database I have, increasing the max depth makes it very slow, without changing the results.

So how do I ensure that the filtering of the paths causes the algorithm to prune the paths? The docs are a little thin on this. I can only find examples where exact path lengths are used (p.vertices[1] for example).

1

1 Answers

0
votes

As far as I know, there is only one pattern the optimizer is currently capable of recognizing to prune paths instead of post-filtering, and that is a plain filter on the path variable in combination with the ALL operator.

The inline filter you added may prevent this optimization from being applied. I don't see why you added it in the first place. A vertex without color attribute has an implicit value of null, which is not equal to 'blue' and should thus be unnecessary.

Does this query produce the same results, but faster as you increase the traversal depth?

FOR v, e, p IN 0..5 ANY "circles/A" owns 
    FILTER p.vertices[*].color ALL == 'blue'
    return v._id

There is an open feature request for an explicit way to prune paths. Feel free to add your use case.