Redshift: Does key-based distribution optimize equality filters?

Question

This documentation describes key-distribution in redshift as follows:

The rows are distributed according to the values in one column. The leader node will attempt to place matching values on the same node slice. If you distribute a pair of tables on the joining keys, the leader node collocates the rows on the slices according to the values in the joining columns so that matching values from the common columns are physically stored together.

I was wondering if key-distribution additionally helps in optimizing equality filters. My intuition says it should but it isn't mentioned anywhere.

Also, I saw a documentation regarding sort-keys which says that to select a sort-key:

Look for columns that are used in range filters and equality filters.

This got me confused since sort-keys are explicitly mentioned as a way to optimize equality filters.

I am asking this because I already have a candidate sort-key on which I will be doing range queries. But I also want to have quick equality filters on another column which is a good distribution key in my case.

Jon Scott Jon Scott · Accepted Answer · 2017-11-14T16:01:17

It is a very bad idea to be filtering on a distribution key, especially if your table / cluster is large.

The reason is that the filter may be running on just one slice, in effect running without the benefit of MPP.

For example, if you have a dist key of "added_date", you may find that all of the added date for the previous week are all together on one slice.

You will then have the majority of queries filtering for recent ranges of added_date, and these queries will be concentrated and will saturate that one slice.

Redshift: Does key-based distribution optimize equality filters?

2 Answers