4
votes

There are plenty of resources that recommend using high-cardinality attributes as partition keys. My question is, what will happen if I instead do the exact opposite of this and give all of my items the same partition key value (differentiating only by sort key), allowing me to query over the entire table?

Will this cause performance and/or hot partition issues? Do hot partitions even matter with adaptive capacity if they aren't reaching 3000 RCUs/1000 WCUs? Even then, what if queries are evenly distributed among my sort key?

Consensus seems to be that we shouldn't do this, but my question is: Why not?

3

3 Answers

1
votes

The recommendations and best practices are there to guide you to benefit the most from using DynamoDB. Typically, people use DynamoDB for storing massive and high-velocity data that suffers from scalability problems in the traditional RDBMS.

If you are talking about a small amount of data where the aggregated access velocity doesn't exceed 3000 RCUs/1000 WCUs, that's not enough for you to reach the pain point of using DynamoDB. In fact, you can probably achieve the same level of performance if you use a traditional RDBMS. However, as soon as your app becomes popular, or even if your app just encountered a spike over the time span of 5 minutes, the amount of data and velocity quickly increases, and you will feel the pain. That's why following best practices will usually give you this kind of future proof benefit.

Even then, what if queries are evenly distributed among my sort key?

DynamoDB splits partitions by sort key if the collection size grows bigger than 10 GB. [ref] So it's likely that you will still have the hot partition problem.

Don't get me wrong. There are use cases that require using the same partition key, such as modeling one-to-many and many-to-many relations of your data. These are valid use cases since data is relational by nature and that's the only way to efficiently model it in DynamoDB. However, if you choose to do the exact opposite of what the documentation suggests, your scalability is limited and you will not be able to take the full benefit from DynamoDB.

1
votes

Alright, here we go and I will do it with an example app.

Let's say you are creating a census application for Canada. Your partition key is going to be the province or territory name, of which there are 13 total iirc. You load the initial data in and all is fine. You open it up for users to come in. Everything is going ok, but hit the evening when everyone is home and just got a card saying they should go to your site. Well, where are the population centers in Canada? Ontario and Quebec have the most and they just so happen to be in the same table partition. Oops. Yes, adaptive capacity will try and save you, but in short order there are now 10s of thousands of people (or more) trying to use your site. That partition is now hot as it hits the 3000 IOPS per partition quota with just one section of Toronto online. DynamoDB is already trying to move items to other partitions and creating more to save you from your blunder, but your users are already getting throttled. You chose poorly. Twitter/reddit/etc is now erupting with nasty comments I shall not quote here. Meanwhile the partition that has Prince Edward Island and Yukon in it is not doing much at all. If you had picked a different partition key or used write sharding with the province/territory name, items would be more evenly spread and this would not be a problem.

That said, in another scenario, with a lightly used application and low cardinality PK, all might be well. As that app scales, that is when your blunder will become apparent. If it will never scale then it might be fine…by why hassle with that?

Hopefully you get the point. Also, this kind of thing is not unique to DynamoDB. I have worked with plenty of other databases that do partitioning where this could be a problem. At least DynamoDB is smart enough to try and save you from your blunder over time for you, but why set yourself up for problems?

1
votes

For a scalable app you just can't assume its IOPS is never hit. And because the traffic is never coming evenly from each region, some of the data centers may have much higher traffic than another. And during some special events a huge spike of traffic is expected (e.g., Alexa device accessing on Christmas), adaptive capacity takes effect with an uncertain delay for such case --- so you need to plan for a scale up in advance, and of course try to avoid potential hot partition issues at the very beginning.