0
votes

I came across a situation where I am defining the distkey as the column which is used to join it with other tables (to avoid re-distribution). But that column is not the highest cardinality column, so it leads to skew the data distribution.

Example:

Transaction Table (20M rows)

------------------------------

|   user_id           | int  |

|   transaction_id    | int  |

|   transaction_date  | date |

------------------------------

Let's say most of the joins performed on this table is on user_id, but transaction_id is higher cardinality column. As 1 user can have multiple transactions.

What should be done in this situation?

  • Distribute the table on transaction_id column? Even though it will need to re-distributing the data when joined on user_id with another table

  • Distribute on user_id and let the data be skewed? In my case, the skew factor is ~15 which is way higher than AWS Redshift recommended skew factor of 4.0

1
The general rule is to set the DISTKEY to the column most commonly JOINed. If that is user_id, then it is a good DISTKEY. Using another column simply because it has high cardinality is not suitable if it is not used in the JOIN. - John Rotenstein

1 Answers

1
votes

As John rightly says you LIKELY want to lean towards improving join performance over data skew but this is based on a ton of likely-true assumptions. I'll itemize a few here:

  1. The distribution (disk-based) skew is on a major fact table
  2. The other tables are also distributed on the join-on key
  3. The joins are usually on the raw tables or group-bys are performed on the dist key

Redshift is a networked cluster and the interconnects between nodes is the lowest bandwidth aspect of the architecture (not low bandwidth, just lower than the other aspects). Move very large amounts of data between nodes is an anti-pattern for Redshift and should be avoided whenever possible.

Disk skew is a measure of where the data is stored around the cluster and without query-based-information only impacts how efficiently the data is stored. The bigger impact of disk skew is execution skew - the the difference in the amount of work each CPU (slice) does when executing a query. Since the first step of every query is for each slice to work on the data it "owns", disk skew leads to some amount of execution skew. How much is dependent on many factors but especially the query in question. Disk skew can lead to issues and in some cases this CAN outweigh redistribution costs. Since slice performance of Redshift is high, execution skew OFTEN isn't the #1 factor driving performance.

Now (nearly) all queries have to perform some amount of data redistribution of data when executing. If you do a group-by of two tables by some non-dist-key column and then join them, there will be redistribution needed to perform the join. The good news is that (hopefully) the amount of data post-group-by will be small so the cost of redistribution will be low. Amount of data being redistributed is what matters.

Dist-key of the tables is only one way to control how much data redistributed. Some ways to do this:

  1. If the dimension tables are dist-style ALL then it doesn't (in basic cases) matter that your fact table is distributed by user_id - the data to be joined already exists on the nodes it needs to be on.
  2. You can also control how much data is redistributed by reducing how much data goes into the join. Having where clauses at the earliest stage in the query can do this. Denormalizing your data so that needed where clause columns appear in your fact tables can be a huge win.
  3. In extreme cases you can make derived dist-key columns that align perfectly to user_id but also have greatly reduced disk and execution skew. This is a deeper topic that needs to be in this answer but can be the answer when you need max performance when redistribution and skew are in conflict.

A quick word on "ordinality". This is a rule-of-thumb metric that a lot of Redshift documents use as a way to keep new users out of trouble but that can also be explained quickly. It's an (somewhat useful) over-simplification. Higher ordinality is not always better and in the extreme is an anti-pattern - think of a table where each row of the dist-key has a unique value, now think about doing a group-by on some other column for this table. The data skew in this example is perfect but performance of the group-by will suck. You want to distribute the data to speed up what work needs to be done - not improve a metric.