0
votes

How do I choose a partition key in such a way that I can efficiently query all my documents for a given time period?

Background:

I'm building an analytics tool for a chat application using Azure CosmosDB. I have a separate container to store incoming and outgoing messages. A typical Message document looks like this:

{
    "version": "v1",
    "partition_key": "user_id",
    "timestamp": "2020-01-30 14:02:32.402+00:00",
    "type": "incoming_message",
    "message": "hi there",
    "sender": "sender_id",
    "receiver": "receiver_id",
}

I have considered the following options as my partition key:

  1. User id: With this approach, I can easily query all the messages by user. But, the time based filtering will have to be cross partition queries and the RU cost will be high, especially with thousands of documents in the container.
  2. A date specific value: According to this, I can use the date along with a random number as the partition key (Ex: 2018-08-09.1,2018-08-09.2 and so on). But, with this approach, I will have to pass hundreds of partition keys into an in clause in order to run a query for large time intervals (Ex: last 6 months).

Do you have any recommendations on selecting a better partition key in order to support single partition queries for filtering documents by time?

1
Is timestamp the only filter in your query? Is there no other value will use in this query?Mark Brown
Hi @MarkBrown, I do have other queries (Ex: Get all the messages by a particular user). For other queries, I'm storing a duplicate Message document in a different container partitioned by the user id.OrangeTree

1 Answers

2
votes

Date is typically a poor choice for a partition key in a partitioned data store for three reasons, efficiency, performance and storage.

In write-heavy workloads the current date is always the hot partition. This means that you will only ever use a fraction of whatever throughput you provision. This is not efficient because you will have throughput that is never used. It's also not performant because you will never be able to use all the throughput you've provisioned. In some scenarios this might work. But generally they are small and will always be small. Generally speaking however, you always want to avoid a partition strategy that will never be able to scale and this will not.

The second question to answer is the amount of data for each logical partition to determine the granularity for that partition key value expressed as time. If you have 20GB of data per day then using day or anything longer (week, month, year, etc.) would never work.

The third question to answer is what are the aggregates you are looking to do and how much data are you trying to process in a query. Cosmos DB is not an analytics data store. It is a row-based JSON store and functions best as a master data store and serving layer for computed batch or real-time views. From your question it sounds as though you are looking to do analytics so you would probably benefit from ETL'ing this data into a column-store and doing aggregates there. You can then write back the aggregates to Cosmos and serve from there. This article describes the lambda architecture I'm describing. I'm not saying you would have to use Spark as part of this. But if you're trying to do analytics and perform aggregations over large sets of data that span partitions, you need a batch layer and analytics compute platform to do it.

Hope this is helpful.