Sort & Dist keys selection for a dimension table in Redshift database

Question

In Redshift database, I want to decide a sort key for a dimension table between surrogatekey and natural primary key. The definition says "Sort keys should be selected based on the most commonly used columns when filtering, ordering or grouping the data". My Question is -

I have a Employee table with (Emp_key,Emp_Id,Emp_name) and this table is joined to Fact table on Emp key. Here "Emp_key" is the surrogate key and "Emp_id" is the natural primary key. And I filter the query on Emp_id but "Emp_key" in the fact table is defined as a "dist key" and read that for a large dimension defining sort & dist keys on the join keys results in better performance and so I want to know which one should i choose between Emp_key and Emp_id for Sort key in a dimension table?

And also, another confusion is choosing sort for the "date" dimension table between "date_key" or ignore defining sort key.

I would appreciate your suggestions in this regard.

Thank you!

AlexYes AlexYes · Accepted Answer · 2019-04-05T17:40:13

Your employee table likely doesn't contain too many rows, you can choose ALL distribution style, so the copy of the table sits on every node of your cluster. This way you'll avoid this dilemma at a very low cost.

UPD: with this design, I would have emp_key as dist key (so that data that is joined sits on the same nodes) and emp_id as sort key (to filter efficiently). I'm pretty sure the query planner would prioritize filtering over joining, so first it will filter the rows from the dimension table and only then it will join the corresponding rows from the fact table. But it's better to try all options and benchmark a few queries to see what works best.

If you can change the design I would just add emp_id to the fact table (because it seems like the keys map 1 to 1) as a part of ELT and avoid the dilemma again.

Sort & Dist keys selection for a dimension table in Redshift database

1 Answers