5
votes

If I scan or query in DynamoDB it is possible to set the Limit property. The DynamoDB documentation says the following:

The maximum number of items to evaluate (not necessarily the number of matching items).

So the problem with this is if you set filters and such it won't return all the items.

My goal that I'm trying to figure out how to achieve is to have a filter in a scan or query, but have it return x number of items. No matter what. I'm ok with having to use LastEvaluatedKey and make multiple requests, but I would like to try to make it as seamless and easy as possible (so not doing that would be best.

The only way I have thought to do this is to set the Limit property to say 1 or something. Then just keep scanning or querying using the LastEvaluatedKey until I reach that x number of items I'm looking for. Problem is, this seems VERY wasteful and inefficient. I mean if you have a table of millions of records you might have to make thousands and thousands of requests. It doesn't seem like it scales very well. Of course I'm sure it's no different than what DynamoDB would be doing behind the scenes.

But is there a way to do this more efficiently where I can reduce the number of requests I have to make? Or is that the only way to achieve this?

How would you achieve this goal?

1
Why can't you just use the count in your response?mewa
@mewa I thought that property is how many items DynamoDB returned in the response. That won't really help. Because if I have a massive database I don't want to scan or query more than I have to. It'd be such a waste.Charlie Fish
You won't be scanning more than you have to, since the queries are limited in size as well (1MB). You'd then continue using the LastEvaluatedKey (and add count up while doing so).mewa
@mewa Ok, but couldn't you have a ton of records that are relatively small in size? I mean in that case not limiting it in the request would just chew through your capacity pretty quickly.Charlie Fish
Then you'd have to set an appropriate limit. It depends on a number of factors. First of all, you have to bear in mind that read capacity is counted in multiples of 4KBs. This means that reading 4 bytes is going to be just as expensive as reading 4KBs. Knowing that, you should estimate how much space does your data take on average. You should also think of the distribution your data has. As with any NoSQL, your solution has to be engineered specifically for your use case.mewa

1 Answers

3
votes

A single Query operation will read up to the maximum number of items set (if using the Limit parameter) or a maximum of 1 MB of data and then apply any filtering to the results using FilterExpression.

You're 100% right that Limit is applied before FilterExpression. Meaning Dynamo might return some number or documents less than the Limit while other documents that satisfy the FilterExpression still exist in the table but aren't returned.

Its sounds like it would be unacceptable for your api to behave in the same manner. That is going to mean that in some cases, a single request to your service will result in multiple requests to Dynamo. Also, keep in mind that there is no way to predict what the LastEvaluatedKey will be which would be required to parallelize these requests. So in the case that your service makes multiple requests to Dynamo, they will be serial. To me, this is a rather heavy tradeoff but, if it is a requirement that you satisfy the Limit whenever possible, you have options.

First, Dynamo will automatically page at 1 MB. That means you could simply send your query to Dynamo without a Limit and implement the Limit on your end. You may still need to make multiple requests to ensure that your've satisfied the Limit but this approach will result in the fewest number of requests to Dynamo. The trade off here is the total data being read and transferred. Chances are your Limit will not happen to line up perfectly with the 1 MB limit which means the excess data being read, filtered, and transferred is wasted.

You already mentioned the other extreme of sending a Limit of 1 and pointed out that will result in the maximum number of requests to Dynamo

Another approach along these lines is to create some sort of probabilistic function that takes the Limit given to your service by the client and computes a new Limit for Dynamo. For example, your FilterExpression filters out about half of the documents in the table. That means you can multiply the client Limit by 2 and that would be a reasonable Limit to send to Dynamo. Of the approaches we've talked about so far, this one has the highest potential for efficiency however, it also has the highest potential for complexity. For example, you might find that using a simple linear function is not good enough and instead you need to use machine learning to find a multi-variate non-linear function to calculate the new Limit. This approach also heavily depends on the uniformity of your data in Dynamo as well as your access patterns. Again, you might need machine learning to optimize for those variables.

In any of the cases where you are implementing the Limit on your end, if you plan on sending back the LastEvaluatedKey to the client for subsequent calls to your service, you will also need to take care to keep track of the LastEvaluatedKey that you evaluated. You will no longer be able to rely on the LastEvaluatedKey returned from Dynamo.

The final approach would be to reorganize/regroup your data either with a GSI, a separate table that you keep in sync using Dynamo Streams or a different schema altogether with the goal of not requiring a FilterExpression.