3
votes

I am new to Azure Cosmos DB using the DocumentDB API. I plan to model my data so that one document references another document. This is pretty straight forward, as described in Modeling document data. However, I also would like to separate the related documents into different collections (this decision is related to how the data are partitioned).

Edit 7/24/2017: In response to a comment wondering why I chose to use separate collections: The reasoning for a separate collections mainly comes down to partition keys and read/write priorities. Since a certain partition key is required to be present in ALL documents in the collection, it then makes sense to separate documents that the chosen partition key doesn't belong. After much weighing of options, the partition key that I settled on was one that would optimize write speeds and evenly distribute my data across shards - but unfortunately it did not logically belong in my "Metadata" documents. Since there is a one to gazillion relationship between metadata and measurements, I chose to use a reference to the metadata in the measurements instead of embedding. And because metadata would rarely (or never) be appended to each measurement, I considered the expense of an additional round-trip to the DB a very low concern.

Since the reference is a "weak link" that is not verified by the database, is it possible and wise to store additional information, such as the collection name? That is, instead of having just a string id, we may use a kind of path?

Metadata document in collection "Metadata":
{
  "id": "metadata1",
  ...
}

Measurement document in collection "Measurements":
{
  "id": "measurement1",
  "metadata-id" : "../Metadata/metadata1",
  ...
}

Then, when I parse the data in my application/script I know what collection and document to query.

Finally, I assume there are other/better ways to go about this and I welcome your suggests (e.g. underscores, not slashes; use a symbol to represent a collection, like $Metadata; etc). Or, is my use of relations spanning collections a code smell?

Thank you!

Edit: To the downvoter, can you please explain your reasoning? Is my question uninformed, unclear, or not useful? Why?

2
Can you elaborate as to what about your partitioning makes you think that needing another collection is necessary? I've been using Cosmos extensively for some time and have never found that to be the case. (not the downvoter btw its a fair question) just curious about your reasoning.Jesse Carter
@JesseCarter I updated my question by elaborating my reasoning for using separate collections. I'm curious how you are able to use a single partition key for heterogeneous (assuming) data while optimizing read/write speeds?brudert
Please see the answer I've provided as to how to accomplish what you're looking for with a single collection. You're going down a dangerous and unnecessary path of thinking that you need one collection per type. This is not the case as collections are generic stores and not entity specific tables. Consider the cost difference when you start adding a third or a fourth entity type and have to pay for each new one you add.Jesse Carter
Despite my simplified example, I don't plan on using a single collection per type. However, your comments do hit the mark since I was planning on using collections as logical groupings of different types.brudert
I'm currently working on a huge graph implementation with Cosmos that will have hundreds of different entity types. I've had great success by using a generic partition key and having the individual types specify values for partitionKey that support their individual read/write patterns (sounds like you have that part figured out already which is really good!). If you're smart about choosing the key you should be able to efficiently keep all of your documents together in the same collection as wellJesse Carter

2 Answers

3
votes

You're thinking about this the wrong way and incurring significantly more cost for an "optimization" that isn't necessary since you're billed at the per collection level. What you should be doing is picking a more generic partition key. Something like key or partitionKey. The tradeoff here is that you'll need to ensure in your client application that you populate this property on all of your documents (it may lead to a duplicated value, but ultimately that's okay). They you can continue to use the value of whatever you chose originally for your Measurements document and set something different for your Metadata documents.

I've written about this extensively in some other answers here and I believe it's one of the biggest misunderstandings about using Cosmos effectively and at scale. It doesn't help that in many Cosmos examples they talk about picking a partitionKey like deviceId or postal code which implies that you're dealing with homogeneous documents.

Please refer to this question that I answered regarding homogeneous vs heterogeneous in documentdb. The biggest argument for this pattern is the new addition of Graph APIs in Cosmos which necessitate having many different types of entities in a single collection and supports exactly the use case you're describing minus the extra collections. Obviously when dealing with heterogeneous types there isn't going to be a single property present on all documents that is appropriate for a partition key which is why you need to go generic.

1
votes

What you're trying to do is feasible. The convention you use is not particularly important, as long as you can figure out the reference. Keep in mind though, that using this type of "relations" will be rather slow, because you need to fetch all documents from one collection and then fetch the related documents in a separate query. It can have a serious impact on your application.

Another possibility is to optimise your data for reading: you can embed the metadata document inside the other document. Your data will be duplicated, so if you update those documents, you will have to update them in both collections, but you'll probably write less often than you read (probably, if that's not the case, this setup would be worse).

Your documents would look like this:

Metadata document in collection "Metadata":
{
  "id": "metadata1",
  ...
}

Measurement document in collection "Measurements":
{
  "id": "measurement1",
  "metadata" : {
      "id": "metadata1",
      ...
  },
  ...
}