10
votes

I've got an application that's outgrowing SQL Azure - at the price I'm willing to pay, at any rate - and I'm interested in investigating Azure DocumentDB. The preview clearly has distinct scalability limits (as described here, for instance), but I think I could probably get away with those for the preview period, provided I'm using it correctly.

So here's the question I've got. How do I need to design my application to take advantage of the built-in scalability of the Azure DocumentDB? For instance, I know that with Azure Table Storage - that cheap but awful highly limited alternative - you need to structure all your data in a two-step hierarchy: PartitionKey and RowKey. Provided you do that (which is nigh well impossible in a real-world application), ATS (as I understand it) moves partitions around behind the scenes, from machine to machine, so that you get near-infinite scalability. Awesome, and you never have to think about it.

Scaling out with SQL Server is obviously much more complicated - you need to design your own sharding system, deal with figuring out which server the shard in question sits on, and so forth. Possible, and done right quite scalable, but complex and painful.

So how does scalability work with DocumentDB? It promises arbitrary scalability, but how does the storage engine work behind the scenes? I see that it has "Databases", and each database can have some number of "Collections", and so forth. But how does its arbitrary scalability map to these other concepts? If I have a SQL table that contains hundreds of millions of rows, am I going to get the scalability I need if I put all this data into one collection? Or do I need to manually spread it across multiple collections, sharded somehow? Or across multiple DB's? Or is DocumentDB somehow smart enough to coalesce queries in a performant way from across multiple machines, without me having to think about any of it? Or...?

I've been looking around, and haven't yet found any guidance on how to approach this. Very interested in what other people have found or what MS recommends.

3
Not sure why you decided to color your question with subjective and negative commentary ("Table Storage - that cheap but awful alternative"). Azure Table Storage is a key/value NoSQL store, completely different from SQL Server (relational) or DocumentDB (document).David Makogon
I agree the negative commentary isn't strictly necessary, but I've found ATS to be missing critical features found in other key-value store databases - see feedback.azure.com/forums/217298-storage, for instance. If a widely-promoted technology is all-but-unusable, it doesn't seem out-of-bounds to mention that.Ken Smith
There are many who find it very usable. This is why it's best to leave this type of color commentary out of your question.David Makogon
We can probably agree to disagree :-). (There are some limited scenarios where I've found ATS to be helpful.) In the meantime, since I see you're an Azure architect, any suggestions on how to go about structuring data in DocumentDB to ensure scalability?Ken Smith
Azure Table Storage does not provide automatic infinite scale, either. The throughput limit (based on 1K size messages) for a single partition is only 2,000 messages per second. The aggregate limit for an entire storage account is only 20,000. Yes, it balances I/O, but automatic scale? No.Pittsburgh DBA

3 Answers

13
votes

Update: As of April 2016, DocumentDB has introduced the concept of a partitioned collection which allows you scale-out and take advantage of server-side partitioning.

A single DocumentDB database can scale practically to an unlimited amount of document storage partitioned by collections (in other words, you can scale out by adding more collections).

Each collection provides 10 GB of storage, and an variable amount of throughput (based on performance level). A collection also provides the scope for document storage and query execution; and is also the transaction domain for all the documents contained within it.

Source: http://azure.microsoft.com/en-us/documentation/articles/documentdb-manage/

Here's a link to a blog post I wrote on scaling and partitioning data for a multi-tenant application on DocumentDB.

3
votes

With the latest version of DocumentDB, things have changed. There is still the 10GB limit per collection but in the past, it was up to you to figure out how to split up your data into multiple collections to avoid hitting the 10 GB limit.

Instead, you can now, specify a partition key and DocumentDB now handles the partitioning for you e.g. If you have log data, you may want to partition the data on the date value in your JSON document, so that each day a new partition is created.