0
votes

Given a potentially large (up to 10^7) set of IDs (together with associated partition keys), I need to verify that there is no document in a Cosmos DB collection with an ID that is in the given set.

There are two obvious ways to achieve this:

  1. Check the existence for each ID/partition key pair individually using parallel point reads, with AllowBulkExecution = true, and abort as soon as a read comes back successfully.

  2. Group the IDs by partition key, and for each group, issue parallel queries of the following form (such that each query is smaller than the maximum query size 256 kB), and abort as soon as any query returns with a non-empty result:

    SELECT c.id FROM c 
        WHERE c.partitionkey = 'partition123' AND ARRAY_CONTAINS(['id1', 'id2', ...], c.id) 
        LIMIT 1

Is it possible to say, without trying it out, which one is faster?

Here is a bit more context:

  • The client is an Azure App Service located in the same region as the Cosmos DB instance.
  • The Cosmos DB collection contains about 10^7 documents and has a throughput of 4000 RU/s.
  • The IDs are actually GUID strings of length 36, so the number of IDs per query in Solution 2 would be limited to about 6500 in order to not exceed the maximum query size. In other words, the number of required queries in Solution 2 is about n/6500 where n is the number of IDs in the set.
  • The number of different partition keys is small (< 10).
  • The average document size is about 500 B.
  • Default indexing policy.
  • A bit more background: The check is part of an import/initial load operation. More precisely, it is part of the validation of an import set so an error can be returned before the write operations begin. So the expected (non-error) case is that none of the IDs in the set already exists. The import operation is not expected to be executed frequently (though certainly more than once), so managing auxiliary processes/data just to optimize for this check would not be a good tradeoff.
1
This seems like a scenario suited to using the change feed to create a dedicated parallel form of item just for this query. It could be in effect your own index, perhaps using a prefix of the ID to allow efficient point queries in a bounded search space, e.g. see if your ID matches one of 2^16 prefixes, then if so (unlikely), check for the full ID. I'd be aiming to get as much of the work done in memory on the client first to reduce the need to hit Cosmos, and then only for the cheapest read.Noah Stahl
@NoahStahl Interesting idea. This would work if the collection contained relatively few items. In the optimal case we would only need 2^16 point reads on the prefixes (true negatives). But in our case where the collection contains many more than 2^16 items (e.g. in the order of 10^7), most of these prefixes will be taken, so we would get lots of false positives, and we would revert back to 10^7 point reads on the full IDs.Mo B.
I added some more details on the background.Mo B.
If this is infrequent, my hunch would be to dump out all the item IDs in bulk from Cosmos, then load into a giant HashSet in memory to efficiently check for existence. Compared to the prospect of millions of network requests to Cosmos, seems like this would win.Noah Stahl
@NoahStahl That's a good idea, at least if the number of documents in the collection stays within the order of 10^7. The most efficient way of "dumping" would probably be the query SELECT VALUE c.id FROM c, right? I might do a benchmark with these three methods and share the results here.Mo B.

1 Answers

1
votes

Not quite sure I understand the need for this but... queries will cost more than a point-read, in terms of RU cost (and given your doc size, those point reads are going to cost 1 RU).

I don't see how you will be able to abandon parallel point-reads if you succeed in finding a particular ID within a given partition. Also remember that an ID is only unique within a partition, so it's possible to have that ID in multiple partitions.

It is likely more efficient to just attempt to write a given ID to a given partition, and see if it succeeds (it'll fail if there's an ID collision).

Lastly: For all practical purposes, you won't have a duplicate ID if you're generating a new GUID for every document you're saving.