Redshift documentation identifies time-series tables as a best practice: http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-time-series-tables.html
However, it doesn't address any of the following issues:
- how many tables within a union-all view is reasonable - hundreds? (unanswered)
- any method of writing to the union-all view and having redshift direct those inserts to the correct underlying tables? (Answer: no)
- most effective method of loading underlying tables? Perhaps using firehose to insert into a staging table then periodically inserting those rows into appropriate table within union-all view? (unanswered)
- any way to enable redshift to eliminate some underlying partitions (tables) when querying the union-all view if their date range is outside of a query's criteria? (Answer: No)
- can redshift support dropping old tables, adding new tables and rebuilding union-all view within a transaction? (unanswered)
My situation:
- 100 million rows added daily, which will grow to 500 million in 3 years
- 12 month retention desired
- Estimated 99% of all queries will hit the most recent 1-7 days
- Data is written to existing table via kinesis firehose to s3 which then triggers a copy to redshift table.
My proposed solution:
- Create a year of daily tables with a union all view, along with a dist_key of sensor_id (100,000+ uniq values) and a sort_key of (timestamp, sensor_id).
- Have firehose load into staging table
- Create separate process that once an hour queries staging table to discover dates of data within table, then performs an insert into 'appropriate table' select * from where timestamp = table's timestamp.
- This hourly writer can probably wrap a table rename, multiple insert-selects, and table recreate in a transaction to be invisible to firehose.
- Once a month drop old tables, create next month of tables, and rebuild view.
- This union-all view maintenance can probably be wrapped in a transaction to avoid impacts to users.
- Once a night run the vacuum analyzer.
EDITS: added notes identifying which issues have been answered, and added some detail to the proposed solution.