5
votes

I am developing an application that allows users to read books. I am using DynamoDB for storing details of the books that user reads and I plan to use the data stored in DynamoDB for calculating statistics, such as trending books, authors, etc.

My current schema looks like this:

user_id | timestamp | book_id | author_id 

user_id is the partition key, and timestamp is the sort key.

The problem I am having is that, with this schema I am only able to query the details of the books that a single user (partition key) has read. That is one of the requirements for me.

The other requirement is to query all the records that has been created in a certain date range, eg: records created in the past 7 days. With this schema, I am unable to run this query.

I have looked into so many other options, and haven't figured out a way to create a schema that would allow me to run both queries.

  • Retrieve the records of the books read by a single user (Can be done).
  • Retrieve the records of books read by all the users in last x days (Unable to do it).

I do not want to run a scan, since It will be expensive and I looked into the option of using GSI for timestamp, but it requires me to specify a hash key, and therefore I cannot query all the records created between 2 dates.

2
Did you manage to find a good solution. Because I am having exact same scenario. Did you find anything other than scan ?AMBasra

2 Answers

3
votes

One naive solution would be to create a GSI with a constant hash key across all books and timestamp as a range key. This will allow you to perform your type of queries.

The problem with this approach is that it is likely to become a scaling bottleneck, as same hash key means same node. One workaround for this problem is to do sharding: create a set of hash keys (ex: from 1 to 10) and assign random key from this set to every book. Then when you make a query you will need to make 10 queries and merge results. You can even make this set size dynamic, so that it scales with your data.

I would also suggest looking into other tools (not DynamoDB) for this use case, as DDB is not the best tool for data analysis. You might, for example, feed DynamoDB data into CloudSearch or ElasticSearch and do your analysis there.

1
votes

One solution could be using GSI and including two more columns, when ever you ingest a record kindly ingest date as a primary key e.g 2017-07-02 and timestamp as range key 04:22:33:000.

Maintain one table for checkpoint which would contain the process name and timestamp of the table, Everytime you read from the table you can update the checkpoint table to get incremental data. if you want to get last 7 day data change timestamp to past 7 date and get data between last 7 day and current time.

You can use query spec for the same by passing date as a partition and using between keywords for timestamp which is range condition.

Date diff you will to calculate from checkpoint table and current date and so day wise you get the data.