0
votes

I have an ArangoDB with two collections, one for Documents, and one for Edges, I'd like an efficient query capable of only returning the Edges that share the same _to and _from values.

For example, assume my Documents Collection contains A, B, C, and D. Let's also assume my Edges Collection contains X, Y, and Z.

Edge X is _from A and _to B.

Edge Y is _from A and _to B as well.

Edge Z is _from C and _to D.

However, I do not know that X and Y are basically the same Edge, and I don't know that the Documents that share similar Edges are A & B. My query is to seek out these duplicate Edges.

What I have so far looks like this:

FOR ec1 IN edge_collection
    FOR ec2 IN edge_collection
        FILTER ec1._key != ec2._key AND ec1._to == ec2._to AND ec1._from == ec2._from
        RETURN ec1

This seems to work, though it also feels terribly inefficient. Is there a better way to go about doing this, or is this the best possible solution with AQL?

2

2 Answers

1
votes

You can group by _from and _to, count how many edges there are per group, and filter out the unique combinations:

FOR ec IN edge_collection
  COLLECT from = ec._from, to = ec._to WITH COUNT INTO count
  FILTER count > 1
  RETURN { from, to, count }

Or if you want to return the edge keys as well:

FOR ec IN edge_collection
  COLLECT from = ec._from, to = ec._to INTO edges = ec._key
  LET count = LENGTH(edges)
  FILTER count > 1
  RETURN { from, to, count, edges }

Alternatively using an aggregation instead of a post-calculation:

FOR ec IN edge_collection
  COLLECT from = ec._from, to = ec._to AGGREGATE count = LENGTH(1) INTO edges = ec._key
  FILTER count > 1
  RETURN { from, to, count, edges }

To return the full edges use INTO edges = ec instead. You could also use just INTO edges but then each edge will be nested in an object {"ec": … }.

1
votes

After some more digging, I have found a significantly faster methodology using the COLLECT statement.

Also, full disclosure, this was me building off of this other answer.

LET duplicates = (
    FOR ec IN edge_collection
        COLLECT from = ec._from, to = ec._to WITH COUNT INTO count
        FILTER count > 1
        RETURN {
            from: from,
            to: to,
            count: count
        }
)

FOR d IN duplicates
    FOR ec IN edge_collection
        FILTER d.from == ec._from AND d.to == ec._to
        RETURN ec

EDIT:

Building on @CodeManX's answer, my end goal was to be able to delete or rather REMOVE these duplicate values. If someone stumbles upon this with a similar goal, this might be helpful:

LET duplicates = (
    FOR ec IN edge_collection
        COLLECT from = ec._from, to = ec._to AGGREGATE count = LENGTH(1) INTO edges = ec._key
        FILTER count > 1
        RETURN { from, to, count, edges }
)

FOR d IN duplicates
    LET key_to_delete = FIRST(d.edges)
    REMOVE { _key: key_to_delete } IN edge_collection