Given a potentially large (up to 10^7) set of IDs (together with associated partition keys), I need to verify that there is no document in a Cosmos DB collection with an ID that is in the given set.
There are two obvious ways to achieve this:
Check the existence for each ID/partition key pair individually using parallel point reads, with
AllowBulkExecution = true
, and abort as soon as a read comes back successfully.Group the IDs by partition key, and for each group, issue parallel queries of the following form (such that each query is smaller than the maximum query size 256 kB), and abort as soon as any query returns with a non-empty result:
SELECT c.id FROM c
WHERE c.partitionkey = 'partition123' AND ARRAY_CONTAINS(['id1', 'id2', ...], c.id)
LIMIT 1
Is it possible to say, without trying it out, which one is faster?
Here is a bit more context:
- The client is an Azure App Service located in the same region as the Cosmos DB instance.
- The Cosmos DB collection contains about 10^7 documents and has a throughput of 4000 RU/s.
- The IDs are actually GUID strings of length 36, so the number of IDs per query in Solution 2 would be limited to about 6500 in order to not exceed the maximum query size. In other words, the number of required queries in Solution 2 is about
n/6500
wheren
is the number of IDs in the set. - The number of different partition keys is small (< 10).
- The average document size is about 500 B.
- Default indexing policy.
- A bit more background: The check is part of an import/initial load operation. More precisely, it is part of the validation of an import set so an error can be returned before the write operations begin. So the expected (non-error) case is that none of the IDs in the set already exists. The import operation is not expected to be executed frequently (though certainly more than once), so managing auxiliary processes/data just to optimize for this check would not be a good tradeoff.
SELECT VALUE c.id FROM c
, right? I might do a benchmark with these three methods and share the results here. – Mo B.