0
votes

I have set the partition key of one of my Cosmos DBs to /partition.

For example: We have a Chat document that contains a list of Subscribers, then we have ChatMessages that contain a text, a reference to the author and some other properties. Both documents have a partition property that contains the type 'chat' and the chats id.

Chat example:

{
"id" : "955f3eca-d28d-4f83-976a-f5ff26d0cf2c",
"name" : "SO questions",
"isChat" : true,
"partition" : "chat_955f3eca-d28d-4f83-976a-f5ff26d0cf2c",
"subscribers" : [
    ...
]
}

We then have Message documents like this:

{
"id" : "4d1c7b8c-bf89-47e0-83e1-a8cf0d71ce5a",
"authorId" : "some guid",
"isMessage" : true,
"partition" : "chat_955f3eca-d28d-4f83-976a-f5ff26d0cf2c",
"text" : "What should I do?"
}

It is now very convenient to return all messages for a specific chat, I just need to query all documents of the partition chat_955f3eca-d28d-4f83-976a-f5ff26d0cf2c with the property isMessage = true. All good...

But if I now want to query my db for a specific message by id, I usually just know the id, but not the partition and therefor have to run a slow crosspartition query. Which then led me to the question if I should not add the partitionKey to the message id so I can split the id when querying the db for a faster lookup. I saw that the _rid property of a document looks like a combination of the id of a db and the id of the collection and then a document specific id. What I mean by this is (simplified):

Chat.Id = "abc"
Chat.Partition = "chat_abc"  //[type]_[chatId]
Message.Id = "chat_abc|123" //[Chat.Partition]|[Message.Id]
Message.Partition = chat_abc //[Chat.Partition]

Lets assume that I now want to get the Message document by the id, I just split the id by the | symbol and then query the document with the 1st part of the id as partition and the full id as the key.

Does that make sense? Are there better ways to do this? Should I just always also pass the partitionKey of a document along, not just it's id? Should I just use the _rid properties instead?

Any experience is highly appreciated!

UPDATE

I have found the following answer here:

Some applications encode partition key as part of the ID, e.g. partition key would be customer ID, and ID = "customer_id.order_id", so you can extract the partition key from the ID value.

I have further asked the cosmos team by email if this is a recommended pattern and post an answer, in case I get any.

2

2 Answers

1
votes

Yes, your proposal to extract partition key from id (via a convention like a prefix/delimiter) makes sense. This is common among applications that have a single key and want to refactor it to use Cosmos DB from a different storage system.

If you're building your application from scratch, you should consider wiring the composite key (partition key + item key ("id")) through your API/application.

0
votes

First, if you know your data (and index) size) will remain within the 10gb limit and you RU/sec limit is ok, then a fixed partition-less collection will bypass this problem. Probably OP has knowlingly made the decision that partitioning is required, but it is an important consideration to note for generalization purposes. If possible, KISS ;)

If partitioning is a must, then AFAIK you cannot avoid crosspartition split and its overhead unless you know the partition key.

Imho the OP suggestion of merging the duplicated partition key into id field is a rather ugly solution, because:

  1. Name id implies it is unique key, partition key is not part of it or necessary for this key and its uniqueness. Anyone using this key upstream would incur the forced excess cost of longer key, blocked from using the simpler Guid type, etc.
  2. It will become a mess should your partitioning key change in future.
  3. The internal structure of merged id would not be intuitive without documentation - it's parts are not named and even if they look like to have a pattern new devs would not know for sure without finding external documentation to reliably understand what's going on.
  4. Your data model does not require this duplication on semantic level, it would be for your application querying comfort and hence such hacks should belong to your application code, not data model. Such leaking concerns should be avoided if possible.
  5. Data duplication within document would unnecessarily increase document size, bandwidth, etc. (may or may not be notable, depending on scale and usage). in-document duplication is necessary at times, but imho not necessarily in this case.

A better design would be to ensure the partition key is always present in logic context and could be passed along to lookups. If you don't have it available, then maybe you should refactor you application code (not data design) to explicitly pass around the chatId along with id where needed. That is WITHOUT merging them together into some opaque string format.

Also, I don't see a good way to use _rid for this as if I remember correctly, it did not contain any internal reference to a partition or partition key.

Disclaimer: I don't have any access or deep insight into internal CosmosDB index design or _rid logic on partitioned collections. I may have misunderstood how it works.