2
votes

Based on this article, I have a question of strategy:

https://docs.microsoft.com/en-us/azure/cosmos-db/partition-data

A) Should I be structuring my partition keys so that my queries (ideally) end up at one partition? E.g. PartitionKey = CustomerId

OR

B) Does document still handle queries that cross multiple (many) partitions efficiently? Eg. PartitionKey = "CustomerId+ContextName+TypeName"

We currently have "A" implemented, but have discussed "B" because of the article has this quote in it:

It is a best practice to have a partition key with many distinct values (100s-1000s at a minimum).

Emphasis on "at minimum". Our CustomerIds will not be of a volume to produce more than 2-300 partition keys. Should we add more information to it ("B"), knowing that one query may hit 30-50 partitions (i.e. the "TypeId" addition specifically)

SELECT * FROM c 
WHERE(MyPartition = "1+ContextA+TypeA"
   OR MyPartition = "1+ContextA+TypeB"
   OR MyPartition = "1+ContextA+TypeC"
   ...)
   AND <some other conditions>

The scenarios laid out in the article seem to presume that customer or user will generate plenty of keys. This isn't going to be true for us.

1
Please refer to the document to get more info about Azure documentDB. From the document,we could know what data is stored in the same partition and how to choose right partition key property - Tom Sun - MSFT
@TomSun - thank you for the link. I have read that document. I can discriminate my data in multiple ways. it doesn't seem to answer the fundamental question of: should my partitions be designed so that my queries target a single partition, or does querying across multiple partitions still perform well? - TBone

1 Answers

6
votes

Docdb Sdk makes parallel calls when you run a cross partition query. If you check the network traffic, you would notice that, it first queries the physical partition key ranges and then makes individual calls to each partition key range. It does it in parallel, and it allows to control the maxdegreeofparallelism etc.

Having said that, there are two aspects to consider:

  • Volume of the data

If your volume is say 1 TB, that would mean it would required at-least 100 Physical partitions (each partition being 10 GB), hence it would make atleast 100 calls. If your data volumes grow higher, making more calls might start to hurt the performance.

  • Querying aggregations

If you are using aggregations, currently supported by doc db SUM/AVG/COUNT/MIN/MAX. These cannot be performed across partitions.