1
votes

I am seeking some clarity around the primary key choices for a DynamoDB table that my company's already implemented. The table consists of the following attributes:

    WeekStartingOn   STRING (represents the Monday of the week the transaction posted)
    TransactionID    STRING (UUID - unique)
    VendorID         STRING (UUID - unique by vendor)
    dccAmount        NUMBER 
    pointOfSaleTime  STRING (Storing UNIX timestamp)
    TerminalID       NUMBER (UUID)

The table's primary key definition:

weekStartingON       PRIMARY PARTITION KEY
TransactionID        PRIMARY SORT KEY

Current GSI's: vendorIDIndex

VendorID             PARITITON KEY
pointOfSaleTime      SORT KEY

Sample data:

DynamoDb screenshot

The main query type is:

For a vendor, show all the transactions in the past day, week, month, year, etc.

I believe the thinking behind the current layout is to group all of the past week's transactions contiguously, then from there, select the vendor's transactions.
I'm sure that this design is not right. Using weekStartingOn as the partition key will result in hot keys, as most vendors will want to look at, for example, everything since weekStartingOn = 2016-12-05. Also, sorting by transactionID doesn't make any sense. I would be more inclined to have the base table primary key defined as per the vendorIDIndex, i.e.

VendorID         PARTITION KEY
pointOfSaleTime  SORT KEY

Even so, I still have a couple of problems with this design. Some of our Vendors are much larger than others and will make the distribution of reads/writes across partitions unbalanced. For example, VendorA may have 500000 daily transactions, but VendorB may have only 10 daily transactions. Also, I am not entirely convinced that a combination of VendorID and pointOfSaleTime is guaranteed to be unique.

Or, slightly more complicated, and would require work for the devs:

1 - Randomise the VendorID by adding a suffix, i.e. -1
2 - Depending on the number of suffixes, query the VendorID + Suffix, X amount of times
3 - Merge the results

I think I like the last option, but most difficult to implement considering where we currently are.

What would be the best design for this?

Many thanks

2

2 Answers

1
votes

I would stream updates to this table to AWS ElasticSearch using a lambda function to generate the aggregates you require. Also, it seems like most of your queries are time boxed, so it might be worth it to use the time-series tables design pattern. Have a table for each month of data, and adjust the throughput for older tables as they become cold. You are limited to 256 tables per account per region, so perhaps keep a year's worth of data in DynamoDB and move the rest to colder storage (S3, for example). You would not lose the ability to query your +1 year old data even if you store it in S3, because now you can query S3 buckets with SQL using the AWS Athena service.

0
votes

In the end, I went with a partition (primary) key of "transactionID". This is globally unique and achieved 100% write throughput.. Downsides of this:

1) LSI's redundant. No point having a range key for a UUID

2) We don't query on transactionID directly without already knowing the VendorID. So, to get a Vendor's transaction from the base table, we'd have to scan all transactionId's to find all transactions for a Vendor

3) Need to create additional GSI's for the VendorID queries. However, we have a narrow query criteria, so not a problem