8
votes

Redshift documentation identifies time-series tables as a best practice: http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-time-series-tables.html

However, it doesn't address any of the following issues:

  • how many tables within a union-all view is reasonable - hundreds? (unanswered)
  • any method of writing to the union-all view and having redshift direct those inserts to the correct underlying tables? (Answer: no)
  • most effective method of loading underlying tables? Perhaps using firehose to insert into a staging table then periodically inserting those rows into appropriate table within union-all view? (unanswered)
  • any way to enable redshift to eliminate some underlying partitions (tables) when querying the union-all view if their date range is outside of a query's criteria? (Answer: No)
  • can redshift support dropping old tables, adding new tables and rebuilding union-all view within a transaction? (unanswered)

My situation:

  • 100 million rows added daily, which will grow to 500 million in 3 years
  • 12 month retention desired
  • Estimated 99% of all queries will hit the most recent 1-7 days
  • Data is written to existing table via kinesis firehose to s3 which then triggers a copy to redshift table.

My proposed solution:

  • Create a year of daily tables with a union all view, along with a dist_key of sensor_id (100,000+ uniq values) and a sort_key of (timestamp, sensor_id).
  • Have firehose load into staging table
  • Create separate process that once an hour queries staging table to discover dates of data within table, then performs an insert into 'appropriate table' select * from where timestamp = table's timestamp.
  • This hourly writer can probably wrap a table rename, multiple insert-selects, and table recreate in a transaction to be invisible to firehose.
  • Once a month drop old tables, create next month of tables, and rebuild view.
  • This union-all view maintenance can probably be wrapped in a transaction to avoid impacts to users.
  • Once a night run the vacuum analyzer.

EDITS: added notes identifying which issues have been answered, and added some detail to the proposed solution.

1
I'm currently battling something similar. Did you go by with it? What is your experience? Any insights on performance implications of doing UNION of e.g. 30 tables (month worth of data)?Milovan Zogovic
I did not implement time-series tables - and am in fact very disappointed in the functionality: Amazon's documentation is completely insufficient, it appears that you get zero partitioning benefits but it makes query planning more difficult, and direct & simple loads (say from firehose) impossible. The only benefits are to rolling off data and perhaps analyzing & vacuuming. My solution so far is ignore the problem for now and evaluate snowflake.KenFar
I feel you. They market it as petabyte-scale warehouse. I wonder how petabyte-scale vacuuming would work :) Snowflake seems interesting. Have you tried BigQuery? It has great support (at least in docs) for partitioning.Milovan Zogovic
I'm having the same problem and trying to figure out the best possible approach. It would be nice if you could share your experience.Neoster
I agree to KenFar's view that time-series table does not worth the effort if it's connecting to some steam input (e.g. firehose), handling the insertion would be difficult and error-prone.Fishball

1 Answers

4
votes

Your proposed process sounds quite good! While I can't answer all your questions, here is some information:

Any method of writing to the union-all view and having redshift direct those inserts to the correct underlying tables?

Views are read-only. It is not possible to write to a view, nor is it possible to insert data while expecting Redshift to send it to an appropriate table (eg a specific table for the given day).

Any way to enable redshift to eliminate some underlying partitions (tables) when querying the union-all view if their date range is outside of a query's criteria?

Redshift will not exclude specific tables from the query, but it will avoid reading particular disk blocks through the use of Zone Maps. Each block of data written to disk is associated with a specific table and column. The block has a Zone Map, which indicates the minimum and maximum values of that field stored within the block.

If a query includes a WHERE clause, Redshift can skip blocks that do not contain relevant data. This is particularly powerful when used on the SORTKEY column, since similar ranges of data are grouped together.

Given that you are using a date as the SORTKEY, Redshift will read very few disk blocks if the query includes a WHERE clause based on that column. This is very similar to the idea of skipping tables, but it actually skips reading disk blocks.