27
votes

Redshift allows designating multiple columns as SORTKEY columns, but most of the best-practices documentation is written as if there were only a single SORTKEY.

If I create a table with SORTKEY (COL1, COL2), does that mean that all columns are stored sorted by COL1, then COL2? Or maybe, since it is a columnar store, each column gets stored in a different order? I.e. COL1 in COL1 order, COL2 in COL2 order, and the other columns unordered?

My situation is that I have a table with (among others) a type_id and a timestamp column. Data arrives roughly in timestamp order. Most queries are joined against / restricted by both type_id and timestamp. Usually the type_id clauses are more specific, meaning a much larger percentage of rows can be excluded by looking at the type_id clause than by looking at the timestamp clause. type_id is the DISTKEY for this reason. I'm trying to understand the pros and cons of SORTKEY (type_id), SORTKEY (stamp), SORTKEY (type_id,stamp), SORTKEY (stamp,type_id).

Thanks.

3
If you want your results sorted by more that one column (ORRDER BY 1.2.3...), sort your data accordingly.Guy

3 Answers

19
votes

If you declare SORTKEY(COL1, COL2), all columns will be sorted by COL1, then COL2 as if ORDER BY (COL1, COL2) was done.

If you are using SORTKEY to speed up a JOIN, AFAIU it doesn't matter so long as you use the same SORTKEY on the tables that will be joined because what happens is a merge join.

If COL1 is highly selective like your type_id, it means there are only small numbers of rows which has the same type_id. Therefore although you can add another column to SORTKEY, its utility is limited because most of the row elimination has already happened.

If COL1 is not highly selective like your stamp (which is a bit weird btw; I would have expected it to be more selective than type_id? Anyways..), it means that filtering by stamp won't eliminate that much rows. So it makes more sense to declare a second sort key. However, this is less efficient than the other way around as eliminating rows earlier would be cheaper. If you sometimes filter by stamp but not by type_id, it may make sense to do this though.

17
votes

We are also using Redshift and we have about 2 billion records (+20 million every day) and I have to say, the less selective the sort_key is, the more ahead it should be in the sort_key list.

In our case (and please be advised to analyze how you use/query your own data) we used timestamp as first sort_key. The problem with this is, that even within 1 second we record about 200 rows, which results our 1MB blocks contain only a few seconds, and every type of data in that single block. Meaning, even though timestamp is highly selective, after we cannot really filter further as we have all kinds of data in every block.

Recently we have reversed the order of the sort_keys. The first one has about 15 different values, the second has about 30, etc... and timestamp is the last one now, but still, one block is still measured in seconds.

This results, (since we use the first two sort_keys as filters very frequently) the following: Old solution: A year of data, select a month, it drops 91% of the blocks, but after it has to open all of them, even though we want to filter further.

The new solution drops about 14/15 of the blocks in the first step, regardless of the date range, then about 95% of the remaining ones, and timestamp still drops 91% of the remaining ones.

We have tested it thoroughly with two, 800 million records tables, which were the same, except the order of the sort keys. The higher the time-period in the 'where' clause was, the better results we got. It got even more significant in case of joins obviously.

So my suggestion is, know your database and what kind of queries you run frequently, because the most selective column might not be the best first sort_key. Just as Enno Shioji said, it all depends on by what you are filtering.

3
votes

I will say the order for sort_key should be

  1. consider those in dist, filter and join first
  2. consider those in filter, join
  3. consider those in filter
  4. consider those in join
  5. consider those in group by, order by (including window function)

the general rule: lower cardinality put first if same level.