3
votes

I have a query that compares two collections and finds the "missing" documents from one side. Both collections (existing and temp) contain about 250K documents.

FOR existing IN ExistingCollection
    LET matches = (
        FOR temp IN TempCollection
            FILTER temp._key == existing._key
            RETURN true
    )
    FILTER LENGTH(matches) == 0
    RETURN existing

When this runs in a single-server environment (DB and Foxx are on the same server/container), this runs like lightning in under 0.5 seconds.

However, when I run this in a cluster (single DB, single Coordinator), even when the DB and Coord are on the same physical host (different containers), I have to add a LIMIT 1000 after the initial FOR existing ... to keep it from timing out! Still, this limited result returns in almost 7 seconds!

Looking at the Execution Plan, I see that there are several REMOTE and GATHER statements after the LET matches ... SubqueryNode. From what I can gather, the problem stems from the separation of the data storage and memory structure used to filter this data.

My question: can this type of operation be done efficiently on a cluster?

I need to detect obsolete (to be deleted) documents, but this is obviously not a feasible solution.

1

1 Answers

0
votes

Your query executes one subquery for each document in the existing collection. Each subquery will require many HTTP roundtrips for setup, the actual querying and shutdown.

You can avoid subqueries with the following query. It loads all document _key's into RAM - but that should be no problem with your rather small collections.

LET ExistingCollection = (FOR existing IN c2 RETURN existing._key)
LET TempCollection = (FOR temp IN c1 RETURN temp._key)
RETURN MINUS(ExistingCollection, TempCollection)