Redshift Sort Keys For Joins

Question

I've read various posts and am still unclear. With a star schema, I would think that if I drive a query off a dimension table, say d_article, I end up with a set of SKs (sk_article) that are used to query/probe the main fact table. So, it makes sense to set sort keys on the fields commonly used in the Where clause on that dim table.

Next...and here's what I can't find an example or answer...should I include sk_article in a sort key in the fact table? More specifically, should I create an interleaved sort key with all the various SKs since we don't always use the same ones to join to the fact table?

I have seen no reference to including sort keys for use in Joins, only.

https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html

Amazon Redshift Foreign Keys - Sort or Interleaved Keys

Red Boy Red Boy · Accepted Answer · 2018-10-30T09:35:07

Redshift Sort Key

Sort keys are just for sorting purpose, not for joining purpose. There can be multiple columns defined as Sort Keys. Data stored in the table can be sorted using these columns. The query optimizer uses this sort ordered table while determining optimal query plans.

Also, as Tony commented,

Sort Keys are primarily meant to optimize the effectiveness of the Zone Maps (sort of like a BRIN index) and enabling range restricted scans. They aren't all that useful on most dimension tables because dimension tables are typically small. The only time a Sort Key can help with join performance is if you set everything up for a Merge Join - that usually only makes sense for large fact-to-fact table joins. Interleaved Keys are more of a special case sort key and do not help with any joins.

Every type of those keys has specific purpose. This may be good read for you.

For joining, fact and dimension tables, you should be using distribution key.

Redshift Distribution Keys (DIST Keys)

It determine where data is stored in Redshift. Clusters store data fundamentally across the compute nodes. Query performance suffers when a large amount of data is stored on a single node. Here is good read for you.

I hope this answers your question.

I good video session is here that may be really helpful in understanding SORT VS DIST Key.

Redshift Sort Keys For Joins

1 Answers