2
votes

In the interest of better understanding Amazon's DynamoDB, Lambda functions and IAM roles (I'll stick to DynamoDB in this question), I'm setting up a Linux device to listen for new DynamoDB items and audibly read out updates that are being added by other functions at a regular interval. My goal is to query or scan items, returning those items in ascending order since a specific timestamp (the last time the device checked).

Here's the item structure I'm using so far:

{
  "id": {
    "S": "1eb4520d44715b6daa5f9d907fe43aab" //md5sum of "time"
  },
  "message": {
    "S": "I'm creating the audible reporting log now."
  },
  "status": {
    "S": "working"
  },
  "time": {
    "S": "1452297505" //timestamp: should probably add milliseconds for sake of unique "id"
  }
}

"id" is the partition key. "time" is the sort key. Looking at this now, I'm guessing I should probably make "time" a number, not a string...

Query or scan? Query seems like the correct option for sorting, but it requires a specific partition ID in the query (at least in in the AWS website query tool), so perhaps I'm adding those incorrectly. Scan loads all items and I'm guessing that the sort is not automatic or an option (at least not in in the AWS website query tool). I really only want to load items greater than a timestamp value, sorted.

Where am I off in my thinking? I appreciate the assistance in advance.

UPDATE

After further experimentation with AWS-CLI and DynamoDB, I ended up using a slightly different solution. Since this is a small scale "hello world" type of project, all update items are added to the same table with a single partition key, "SF Reporter", for now. This could scale if I decide to start monitoring additional "reporter"/service updates with separate queries and/or devices.

{
  "datetime": { //sort key
    "S": "2016-01-11T05:15:02"
  },
  "message": {
    "S": "It is all good."
  },
  "reporter": { //primary partition key
    "S": "SF Reporter"
  },
  "status": {
    "S": "ok"
  }
}

The JSON query itself looks something like this (abbreviated node.js example):

var AWS = require("aws-sdk");
AWS.config.credentials = new AWS.SharedIniFileCredentials({ profile: 'default' });
AWS.config.update({"region": "us-west-2"});
var docClient = new AWS.DynamoDB.DocumentClient();

var params = {
    TableName: "spoken_reports",
    KeyConditionExpression: "#reporter = :reporter and #datetime >= :datetime",
    ExpressionAttributeNames:{
        "#reporter": "reporter",
        "#datetime": "datetime"
    },
    ExpressionAttributeValues: {
        ":reporter":"SF Reporter",
        ":datetime":"2016-01-11T05:15:02"
    }
};

docClient.query(params, onUpdatesReceived);

var onUpdatesReceived = function(err, data) { 
    if (err) {
        console.log(err, err.stack);
    } else {
        console.log(data);
    }
}

The query gets the latest updates sorted by a string timestamp (defaults to ascending order in this example). This allows for some scaling as I can have multiple devices checking the same table for the latest updates. I would create a scheduled query/function to clear out old updates once in a while to keep things light.

2
Learning more from Amazon's sample queries, I'm thinking all items in my log (at least to start or for each device/function that is reporting a status) can have the same id. Perhaps the backup server can have it's own id/"thread", while the website status checker can have it's own id. No need for a unique partition number in this use case. If that's the case, respond with that answer and I'll give you the points.Christopher Stevens
*can have the same partition idChristopher Stevens

2 Answers

0
votes

If you stick with this table design, scanning the entire table is the only option you have, for the reasons you've mentioned: for querying, you need a partition key, which is something your devices have no way of knowing beforehand.

There is another solution that comes to my mind:

  • Let's say your current table is called T1. Create another table, T2, that has deviceID as partition key and timestamp as sort key.
  • You define a AWS Lambda function on T1's stream that will, on any update, push that row in T2 as well, one per device.
  • Now whenever any of your device wakes up, it queries (not scan) T2 with its own device id. Processes all the rows and deletes them.

In other words, T2 will always have all the rows that a given device is yet to process.

0
votes

Dead simple way:

You should set up a global secondary index, and project "isNew" as the primary/hash key to it, with timestamp as the range key.

On creation of an entry, mark isNew as a UUID or something. This will make the table item project into the index.

When you need to check for data, scan the secondary index - the index will have only the results which are new. Then, updateItem the items you have read within the table itself to delete the isNew key on the item. The item will be removed from the secondary index, so it is not read again.