0
votes

I am involved in a time series telemetry project, where we store data into Amazon Redshift. We have a timestamp column for collection time. And ClientID, IOt-ID indicating a unique IOT device within a client.

All our queries are time bound in the sense we query for a particular day/week/month. Would the following be a good dist/sort key ?

  • Distribution key - (Clientid, IOT-ID)
  • Sort key - timestamp
1

1 Answers

1
votes

The general rule for Amazon Redshift is:

  • Set the Distribution Key to the field normally used to JOIN with other tables. This will put all data for a given value of that column on the same slice, making it easier to JOIN with other tables that have the same DISTKEY.
  • Set the Sort Key to the field that is most commonly used in a WHERE statement. Rows will be stored in order of this field, making it easier to "skip over" disk blocks that do not contain the desired data. (This is very powerful.)

So, it sounds like your timestamp field is ideal as the SORTKEY.

The choice of DISTKEY depends on how you JOIN, but can also help GROUP BY since the relevant data is co-located.