2
votes

Let me ask other question about redshift sortkey. We're planning to set the sortkey with the columns frequently used in WHERE statement.

So far, the best combination for our system seems to be: DISTSTYLE EVEN + COMPOUND SORTKEY + COMPRESSED Column (except for First SortKey column)

Just wondering which can be more better, simple SORTKEY or COMPOUND SORTKEY for our BI tables which can have diversified queries according to users' analysis.

For example, we set the compound sortkey according to frequency in several queries' WHERE statement as follows.

COMPOUND SORTKEY
(
PURCHASE_DATE <-- set as first sort key since it's date column.
STORE_ID,
CUTOMER_ID,
PRODUCT_ID
)

But sometimes it can be queried only 'PRODUCT ID' in actual queries, not with other listed sort keys, nor queried different from COMPOUND KEY order.

In that case, may I ask 'COMPOUND SORTKEY' can be useless or simple SORT KEY can be more effective ...?

I'd be so grateful if you would tell me about your idea and experiences.

2

2 Answers

4
votes

The simple rules for Amazon Redshift are:

  • Use DISTKEY on the column that is most frequently used with JOIN
  • Use SORTKEY on the column(s) that is most frequently used with WHERE

You are correct that the above compound sort key would only be used if PURCHASE_DATE is included in the WHERE.

An alternative is to use Interleaved Sort Keys, which give equal weighting to many columns and can be used where different fields are often used in the WHERE. However, Interleaved Sort Keys are much slower to VACUUM and are rarely worth using.

So, aim to use SORTKEY on most of your queries, but don't worry too much about the other queries unless you are having some particular performance problems.

See: Redshift Sort Keys - Choosing Best Sort Style | Hevo Blog

2
votes

Your compound sort key looks sensible to me. It's important to understand that Redshift sort keys are not an index which is used or not used. The sort key is used to physically arrange the data on disk.

The query optimizer "uses" the sort key by looking at the "zone map" (min and max values) for each block during query execution. This happens for all columns regardless of whether they are in the sort key.

Secondary columns in a compound sort key can still be very effective at reducing the data that has to be scanned from disk, especially when the column values are low cardinality.

See this previous example for a query to check on sort key effectiveness: Is my sort key being used?

Please review our guide for designing tables effectively: "Amazon Redshift Engineering’s Advanced Table Design Playbook". The guide discusses the correct use of Interleaved sort keys but note that they should only be used in very specific circumstances.