1
votes

We have a Cosmos DB Collection with around 1 million documents containing user information. Not many additions or updates are done per day. However, we need very high throughput for reading.

Most of the queries will be based on UserId. The UserId property is a numeric value composed of a running number and a check digit.

Based on the official documentation

Some could argue that both, the full UserId and a substring of the UserId (let's say the last 4 digits) could make a good partition key, i.e.

  • Distribute evenly requests and storage
  • Queries could be "efficiently?" routed to the corresponding partition
  • Provide high cardinality

In the future, we might have more than one document per UserId, but let assume no more than 5.

My understanding is that a balance between the number of partitions and the number of documents per partition is also desirable. Thus having 1 document per partition in 1 million partitions is not ideal either. However, on this SO thread, a Microsoft Engineer is suggesting to use the full unique identifier as partition key. (It's worth noting that our case is slightly different, as here the UserId is a running number and not a random GUID). In addition, in the comments of this blog post it's also suggested to use the ID as partition key.

So, considering that: a) this collection will be mostly for read operations, b) we will have between 1 and 2 million UserIds, c) we won't have more than 5 docs per UserId, d) We don't have a requirement of SPs or transactions across multiple users. What Partition Key would be more performant?

  1. The Full UserId
  2. A substring of the UserId (e.g. last 4 digits)
1
as I understand, in most cases you will use direct get by UserId without any high-RU queries. In this case i'll better keep userId as partitionKey. But, you will lose the ability to run stored procedures, trigers etc with more than 1 user changesOlha Shumeliuk
Thanks @OlgaShumeliuk. We are exploring that as an option. We don't have any requirements of transactions across multiple userIds, so that's not an issue. So the question is, is there any performance implications of having very few documents per partition (mostly only one) and that many partitions? Is that a recommended practice?Paco de la Cruz
I am guessing that your confusion is stemming from trying to map partition key to a physical partition. You should not worry about the mapping of physical partition with the partition key. You should strive to have millions of partition keys, and CosmosDB will distribute them intelligently on 5 partition or 500 partition. You don't worry about that. 1 or 5 document per partition is perfectly ok, but as Oolga says SP will deal with one partition at a time. May be this video will clear youtube.com/watch?v=SS6WrQ-HJ30Rafat Sarosh

1 Answers

2
votes

Based on @RafatSarosh's comments and further research, I've learned that having millions of partitions and 1 document per partition is not a bad practice, we can rely on Cosmos DB query execution optimisation.

We'll be using the userId as Partition Key.

HTH