How does table re-distribution work in Amazon Redshift

Question

Can anyone please help explain more in depth as how exactly table re-distribution works in AWS Redshift? I read from AWS Redshift official document which says the query optimizer will always try to re-distribute the table to co-locate the rows whenever the rows required in the queries are not co-located already. If so, that is almost true that every time unpredictable queries come in they will result in the AWS Redshift query optimizer to re-distribute the data as the chance of all the rows co-located for every single query is close to zero?

Thank you.

John Rotenstein John Rotenstein · Accepted Answer · 2020-05-24T11:51:48

The best way to optimize queries in Amazon Redshift is to use appropriate values for the Distribution Key and the Sort Key.

The Distribution Key (DISTKEY) determines which rows to put on which slice (a node can have multiple slices). In general, the DISTKEY should be based on the column that is most JOINed.

Let's say you have a Customer Table and an Invoice Table, and both tables have a customer_id column that is used to JOIN the two tables. If this field is used as the DISTKEY for both tables, then all rows in both tables for a given customer_id will be stored in the same slice. This means that the nodes do not need to redistribute data between slices or nodes. This allows Amazon Redshift to operate on the data in parallel.

Of course, there might be many JOINs used, so this can't always be the case. As you say, "unpredictable queries" will not always be optimized. Recommendations on choosing the best DISTKEY can be found on: Choose the best distribution style - Amazon Redshift

There are also some situations where it is worth using an ALL distribution type, which means the table is replicated on all nodes. This avoids having to send the data between nodes and is recommended for small tables that are often joined in queries.

It should also be mentioned that the SORTKEY is equally important. The simple rule is to set the SORTKEY to be the column that is most often used in a WHERE statement. This allows Redshift to easily "skip over" disk blocks because it stores a Zone Map for each block, indicating the minimum and maximum values stored within the block. (A block contains data from only one column.) If this column is the SORTKEY, then it will be very easy to identify which blocks can be skipped. Since reading from disk is a very time-consuming action, this will also speed-up the query because less data is read from disk.

How does table re-distribution work in Amazon Redshift

1 Answers