0
votes

Imagine that you need to persist something that can be represented with following schema:

{
  type: String
  createdDate: String (ISO-8601 date)
  userId: Number
  data: {
    reference: Number,
    ...
  }
}

type and createdDate are always defined/required, everything else such as userId, data and whatever fields within data are optional. Combination of type and createdDate does not guarantee any uniqueness. Number of fields within data (when data exists) may differ.

Now imagine that you need to query against this structure like:

  1. Give me items where type is equal to something
  2. Give me items where userId is equal to something
  3. Give me items where type AND userId are equal to something
  4. Give me items where userId AND data.reference are equal to something
  5. Give me items where userId is equal to something, where type IS IN range of values and where data.reference is equal to something

As it seems to me HashKey needs to be introduced on table level to uniquely match the item. Only choice that i have is to use something like UUID generator. Based on that i can't query anything from table that i need described above. So i need to create several global secondary indexes to cover all fifth cases above as follows:

  1. For first use case i could create GSI where type can be HashKey and createdDate can be RangeKey.What bothers me from start here as i mentioned, there is high chance for this composite key to NOT be unique.
  2. For second use case i could crate GSI where userId can be HashKey and createdDate can be RangeKey Here probably this composite key can match item uniquely.
  3. For third use case, i have probably two solutions. Either to create third GSI where type can be HashKey and userId can be RangeKey. With that approach i'm losing ability to sort returned data and again same worries, this composite key does not guarantee uniqueness. Another approach would be to use one of two previous GSIs and using FilterExpression, right?
  4. For fourth use case i have only one option. To use previous GSI with userId as HashKey and createdDate as a RangeKey and to use FilterExpression against data.reference. Index can't be created on fields from nested object right?
  5. For fifth use case, because IN operator is only supported via FilterExpression (right?) only option again is to use GSI with userId as HashKey and createdDate as a RangeKey and to use FilterExpression for both type and data.reference?

So as only bright side of this problem i see using GSI with userId as HashKey and createdDate as RangeKey. But again userId is not mandatory field it can be NULL. HashKey can't be NULL right?

Most importantly, if composite key(HashKey and RangeKey) can't guarantee uniqueness that means that saving item with composite key that already exists in index will silently rewrite previous item which means i will lose the data?

1
How frequent your data is updated?Harshal Bulsara
@HarshalBulsara very oftenSrle
I think you can look CloudSearch service, it might help.Harshal Bulsara
Can you elaborate more please, what is CloudSearch? Why you wouldn't use regular API for interacting with DynamoDB?Srle
Cloud search is another AWS service which can be integrated with DynamoDB for searching purpose, as you want to query on your nonkey attributes, plus adding that many GSI does not make any senseHarshal Bulsara

1 Answers

0
votes

The thing about DynamoDB: it is a no-SQL database. On the plus side, it is easy to dump pretty much anything into it so long as you have a unique index and it will be fairly efficiently stored for retrieve if you have a good partition key that sub-divides your data into chunks. On the downside, any query you do against fields that are not the partition key or index (primary or secondary) are slow table scans by definition. DynamoDB is not an SQL database and cannot give SQL-like performance when filtering non-indexed columns. If the performance you see is going to be reasonable, you need to delimit your query results to pre-calculated index values available before doing a query or you need to know the results you are looking for are delimited to a few partition keys.

First let's consider the delimited partition keys route. Once you have delimited the partition keys as much as you can and there are no more indexes to reference, everything else you ask DynamoDB is not really a query, but a table scan. You can ask DynamoDB to do it for you, but you may well be better off taking the full results from a partition key or index query and doing the filter yourself in whatever language you are using. I use Java for this purpose because it is simple to do a query for the keys I need through the Java->DynamoDB API and easy to then filter the results in Java. If this is interesting to you I can put together some simple examples.

If you go the index and filter route, understand that the hash key is still a partition key for the index, which is going to determine how much the GSI can be used in parallel. The bigger your DynamoDB table becomes and the more time sensitive your queries are, the bigger the issue this will become.

So yes, you can make the queries you want with indexes, though it will take some careful planning of those indexes.

1. For first use case i could create GSI where type can be HashKey and
createdDate can be RangeKey.What bothers me from start here as i
mentioned, there is high chance for this composite key to NOT be
unique. 

GSI's do not have to be unique. You will receive multiple rows on a query, but nothing will be broken from DynamoDB's perspective. However, if you use type as your partition key (HashKey), the performance of this query will likely be poor unless you have few records for each of your type values.

2. For second use case i could crate GSI where userId can be HashKey and
createdDate can be RangeKey Here probably this composite key can match item
uniquely. 

No problems here so long as your userId's will be unique on a given day.

3. For third use case, i have probably two solutions. Either to create third
GSI where type can be HashKey and userId can be RangeKey. With that approach
i'm losing ability to sort returned data and again same worries, this
composite key does not guarantee uniqueness. Another approach would be to 
use one of two previous GSIs and using FilterExpression, right?

So the RangeKey is your sort key, at least from DynamoDB's perspective. And yes, if you use a GSI and then Filter, you are table scanning the contents of the GSI indexed rows. But yes, if you are combining two GSI's, you either generate a third index in advance or you filter/scan. DynamoDB has no provisions for doing an INNER JOIN on two indexes. And having type as your partition key and then filtering the results has serious performance issues.

4. For fourth use case i have only one option. To use previous GSI with
userId as HashKey and createdDate as a RangeKey and to use FilterExpression
against data.reference. Index can't be created on fields from nested object
right?

I am not sure about your nested object question, but yes, using the previous GSI with a filter/scan will work.

5. For fifth use case, because IN operator is only supported via
FilterExpression (right?) only option again is to use GSI with userId as
HashKey and createdDate as a RangeKey and to use FilterExpression for both
type and data.reference?

Yes, if you want DynamoDB to do the work for you, this is the way to approach your fifth query. But I go back to my original statement: why do this? If you can create a GSI that efficiently gets you to the records you are interested in, use a GSI. But when I never use filter expressions: I get the full partition, index or GSI results back from a query and do the filtering myself in my programming language of choice.

If you need to do everything in DynamoDB your methods will work, but they may not be very fast depending on how many rows are being filtered. I beat on the performance issue pretty hard because I have seen lots of work go into s database project and then had the whole thing not get used because poor performance made it unusable.