0
votes

This documentation describes key-distribution in redshift as follows:

The rows are distributed according to the values in one column. The leader node will attempt to place matching values on the same node slice. If you distribute a pair of tables on the joining keys, the leader node collocates the rows on the slices according to the values in the joining columns so that matching values from the common columns are physically stored together.

I was wondering if key-distribution additionally helps in optimizing equality filters. My intuition says it should but it isn't mentioned anywhere.

Also, I saw a documentation regarding sort-keys which says that to select a sort-key:

Look for columns that are used in range filters and equality filters.

This got me confused since sort-keys are explicitly mentioned as a way to optimize equality filters.

I am asking this because I already have a candidate sort-key on which I will be doing range queries. But I also want to have quick equality filters on another column which is a good distribution key in my case.

2

2 Answers

1
votes

It is a very bad idea to be filtering on a distribution key, especially if your table / cluster is large.

The reason is that the filter may be running on just one slice, in effect running without the benefit of MPP.

For example, if you have a dist key of "added_date", you may find that all of the added date for the previous week are all together on one slice.

You will then have the majority of queries filtering for recent ranges of added_date, and these queries will be concentrated and will saturate that one slice.

1
votes

The simple rule is:

  • Use DISTKEY for the column most commonly joined
  • Use SORTKEY for fields most commonly used in a WHERE statement

There actually are benefits to using the same field for SORTKEY and DISTKEY. From Choose the Best Sort Key:

If you frequently join a table, specify the join column as both the sort key and the distribution key.

This enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.

Feel free to do some performance tests -- create a few different versions of the table, and use INSERT or SELECT INTO to populate them. Then, try common queries to see how they perform.