As John rightly says you LIKELY want to lean towards improving join performance over data skew but this is based on a ton of likely-true assumptions. I'll itemize a few here:
- The distribution (disk-based) skew is on a major fact table
- The other tables are also distributed on the join-on key
- The joins are usually on the raw tables or group-bys are performed on the dist key
Redshift is a networked cluster and the interconnects between nodes is the lowest bandwidth aspect of the architecture (not low bandwidth, just lower than the other aspects). Move very large amounts of data between nodes is an anti-pattern for Redshift and should be avoided whenever possible.
Disk skew is a measure of where the data is stored around the cluster and without query-based-information only impacts how efficiently the data is stored. The bigger impact of disk skew is execution skew - the the difference in the amount of work each CPU (slice) does when executing a query. Since the first step of every query is for each slice to work on the data it "owns", disk skew leads to some amount of execution skew. How much is dependent on many factors but especially the query in question. Disk skew can lead to issues and in some cases this CAN outweigh redistribution costs. Since slice performance of Redshift is high, execution skew OFTEN isn't the #1 factor driving performance.
Now (nearly) all queries have to perform some amount of data redistribution of data when executing. If you do a group-by of two tables by some non-dist-key column and then join them, there will be redistribution needed to perform the join. The good news is that (hopefully) the amount of data post-group-by will be small so the cost of redistribution will be low. Amount of data being redistributed is what matters.
Dist-key of the tables is only one way to control how much data redistributed. Some ways to do this:
- If the dimension tables are dist-style ALL then it doesn't (in basic
cases) matter that your fact table is distributed by user_id - the
data to be joined already exists on the nodes it needs to be on.
- You can also control how much data is redistributed by reducing how
much data goes into the join. Having where clauses at the earliest
stage in the query can do this. Denormalizing your data so that
needed where clause columns appear in your fact tables can be a huge
win.
- In extreme cases you can make derived dist-key columns that align
perfectly to user_id but also have greatly reduced disk and
execution skew. This is a deeper topic that needs to be in this
answer but can be the answer when you need max performance when
redistribution and skew are in conflict.
A quick word on "ordinality". This is a rule-of-thumb metric that a lot of Redshift documents use as a way to keep new users out of trouble but that can also be explained quickly. It's an (somewhat useful) over-simplification. Higher ordinality is not always better and in the extreme is an anti-pattern - think of a table where each row of the dist-key has a unique value, now think about doing a group-by on some other column for this table. The data skew in this example is perfect but performance of the group-by will suck. You want to distribute the data to speed up what work needs to be done - not improve a metric.
user_id, then it is a good DISTKEY. Using another column simply because it has high cardinality is not suitable if it is not used in the JOIN. - John Rotenstein