I have a query that compares two collections and finds the "missing" documents from one side. Both collections (existing and temp) contain about 250K documents.
FOR existing IN ExistingCollection
LET matches = (
FOR temp IN TempCollection
FILTER temp._key == existing._key
RETURN true
)
FILTER LENGTH(matches) == 0
RETURN existing
When this runs in a single-server environment (DB and Foxx are on the same server/container), this runs like lightning in under 0.5 seconds.
However, when I run this in a cluster (single DB, single Coordinator), even when the DB and Coord are on the same physical host (different containers), I have to add a LIMIT 1000
after the initial FOR existing ...
to keep it from timing out! Still, this limited result returns in almost 7 seconds!
Looking at the Execution Plan, I see that there are several REMOTE
and GATHER
statements after the LET matches ...
SubqueryNode. From what I can gather, the problem stems from the separation of the data storage and memory structure used to filter this data.
My question: can this type of operation be done efficiently on a cluster?
I need to detect obsolete (to be deleted) documents, but this is obviously not a feasible solution.