We would like to store a set of documents in Cosmos DB with a primary key of EventId
. These records are evenly distributed across a number of customers. Clients need to access the latest records for a subset of customers as new documents are added. The documents are immutable, and need to be stored indefinitely.
How should we design our partition key and queries to avoid clients all hitting the same partitions and/or high RU usage?
If we use just CustomerId
as the partition key, we would eventually run over the 10GB limit for a logical partition, and if we use EventId
, then querying becomes inefficient (would result in a cross-partition query, and high RU usage, which we'd like to avoid).
Another idea would be to group documents into blocks. i.e. PartitionKey = int(EventId / PartitionSize)
. This would result in all clients hitting the latest partition(s), which presumably would result in poor performance and throttling.
If we use a combined PartitionKey of CustomerId
and int(EventId / PartitionSize)
, then it's not clear to me how we would avoid a cross-partition query to retrieve the correct set of documents.
Edit:
Clarification of a couple of points:
- Clients will access the events by specifying a list of
CustomerId
's, the lastEventId
they received, and a maximum number of records to retrieve. - For this reason, the use of
EventId
alone won't perform well, as it will result in a cross partition query (i.e.WHERE EventId > LastEventId
). - The system will probably be writing on the order of 1GB a day, in 15 minute increments.
- It's hard to know what the read volume will be, but I'd guess probably moderate, with maybe a few thousand clients polling the API at regular intervals.