0
votes

We are processing a huge file by splitting the file to multiple parts of 200 rows each (storing them in a S3 bucket and processing each file). Each part file has an ID (Partition Key) and the Timestamp is the Sort Key.

I'm looking to find the total count (across multiple IDs/part files) by different statuses (SUCCESS, FAILURE). For eg:

200000 records were successful (Status=Success) within the last 4 hours.
200 records were failed (Status=Failure) due to errorStatus "FAILURE :: Could not open JDBC Connection" within the past 4 hours
158 records were failed (Status=Failure) due to errorStatus "FAILURE :: Network failed" within the past 4 hours.

I'm able to get them by each bID separately. For eg.

aws dynamodb query --table-name abc1 --index-name abcGdx1 --projection-expression "TId" --key-condition-expression "BId = :bId and STimestamp between :sortkeyval1 and :sortkeyval2" --filter-expression "PStatus = :status and PStage = :stage" --expression-attribute-values "{\":bId\": {\"S\": \"c1234-5678-1000\"}, \":stage\": {\"S\": \"C_C\"}, \":status\": {\"S\": \"SUCCESS\"}, \":sortkeyval1\": {\"S\": \"2020-09-22T22:00:42.108-04:00\"}, \":sortkeyval2\": {\"S\": \"2020-09-23T18:52:55.724-04:00\"}}" --return-consumed-capacity TOTAL

Can you please help with an idea on how this can be achieved

1

1 Answers

2
votes

It sounds like the status field is an attribute in your table, and not part of any primary key. If that's the case, you will not be able to use the query operation, since the query operation requires you to know the Primary Key of the item you're looking for (which sounds like your current solution).

You have one of two options:

  1. Perform a scan operation across your entire table for each status you care about. Unlike the query operation, scan lets you search the entire table. It's commonly considered an operation of last resort, as it is slow and expensive compared to query operations. If you were to go this route, the CLI command would look like this:
aws dynamodb scan 
  --table-name abc1 
  --filter-expression "#status < :status" 
  --expression-attribute-names '{"#status": "PStatus"}' 
  --expression-attribute-values '{":status": {"S": "SUCCESS"}}'

  1. Create a secondary index with the status field as your partition key. This will allow you to perform a fast query operation on all items with a given status.

For example, lets assume you have a table that looks something like this:

enter image description here

If you create a secondary index on the status field, your table would logically look like this:

enter image description here

Keep in mind that this is the same data as the first screenshot, just viewed from the perspective of the secondary index. Using this secondary index, you could issue a query operation to fetch all items with a given status:

aws dynamodb query 
  --table-name abc1 
  --index-name <YOUR STATUS INDEX NAME HERE> 
  --key-condition-expression "#pk = :pk" 
  --expression-attribute-names '{"#pk": "PStatus"}' 
  --expression-attribute-values '{":pk": {"S":"SUCCESS"}}'

The main difference between the two approaches in the scan vs query operation. A scan operation needs to look at your entire database to find what you are looking for, which is inefficient. The query operation looks up a specific primary key, which is much faster.