6
votes

I can't figure out how to do this in DynamoDB.

I have a table with data something like this:

ID    Updated     other fields...

1200  2017-12-11              ...
1201  2018-02-05              ...
1205  2018-01-05              ...
1206  2018-01-11              ...
1210  2018-02-15              ...
1212  2018-02-10              ...

The partition key is 'ID' and I have a sort key of 'Updated'.

I want to retrieve the records where Updated is greater than "2018-02-01", say.

I can't query on just 'Updated' alone, it complains with Query condition missed key schema element: ID. I understand what that means, but I'm not sure how to do this properly.

I've tried adding various indexes and then querying on the index, including having only the 'Updated' field as the partition key, but then I can't query for a range of values only an exact match on the partition key.

So, how do I query across multiple partitions for a condition?

I could use a scan, but that is potentially expensive. Can I do this by indexing it a certain way? Or is there a way to do something similar to a query where I don't need to specify the partition key?

3
I'm beginning to think that using a unique ID for a partition key is a bad idea, it would almost be better to partition the items that belong together, but then if two records have the same 'Updated' fields the two keys taken together would not be unique and one record would overwrite the other? I think I'm really confused about how to use the partition key properly.Octopus
Using an unique ID for a partition key is generally a good idea. It helps scalability. Partition keys are not sorted, so you'll need a partition key other then Updated.Costin
In order to assure unicity of keys, consider to append a random value to Updated. Thus, your data will be in the same position, but unique :)Costin

3 Answers

3
votes

Use a scan

Almost everyone using DynamoDB seems to get worried about scans. Scans are FINE in many circumstances. Things you should ask yourself include; how much data will I have, how will it grow over time, how fast do I need the scan to complete, how many RCUs will this cost? Don't just dismiss scans - do the maths.

Archive data

If you only need to access recent data, consider deleting or archiving old data. By removing it from your table you can increase the performance of scans.

Partition by date

There are various strategies you can use to improve your table performance if you really want to use a query. For example you could have a partition key of YYYY-MM and sort key of datetime (down to nanosecond). That way you can retrieve whole months of data in one query, whilst still being able to sort for specific date ranges. This kind of query is much more complicated to handle in your application than a scan. Architecting your tables really depends on your data access patterns.

2
votes

Nice problem, not so nice solution! :)

• You cannot do a query without conditioning on Partition Key.
• You need the Updated column to be a Sorting Key, either in the table "schema", either in an index. If it will not be a sorting key anymore, you wont be able to efficiently query for Updated > VALUE.

So you need a constant partition key and Updated to be the sorting key. Here is your Global Secondary Index:
• PK: ConstantColumn
• SK: Updated

Of course, you'll loose some scalability because all your index will be in one partition, but using a KEYS_ONLY projection should give you enough room.

Should you really need more scalability consider having PK values like C0, C1, ..., Cn, iterate through queries for each partition key, then merge the results (divide et impera).

1
votes

I would consider alternative partition keys. For example, will your business logic work if you create a GSI with year as partition key and date as sort key? How about year-month?

Your query will be more complex to write as you might have to issue multiple queries to cover more than 1 partitions to fill your result page.

But as you pointed out, this is cheaper than performing a full table scan.