Redshift : What can be appropriate combination of DIST KEY and SORT KEY for Master Tables?

Question

Let me ask about what can be approptiate distkey(dist style) and sort key for master tables in Redshift.

We have several master tables with differnet size, purpose and cardinalities and now we're verify what SORTKEY and DISTKEY combination would be more appropriate.

For example, we have two main key in our customer master table, frequent use in our BI queries and join key for other tables, as follows;

Ex. MST_CUSTOMER (about 7 millions, 850MB in SQL Server)

CUSTOMER_ID     <-- Primary Key, Unique and High Cardinality
CUSTOMER_CATEGORY   <-- Low Cardinality (like VIP, BZ) but not always uses in BI query
CUST_NAME
CUST_ADDRESS...

First, if we can use both columns for SORTKEY, which is the best order in this case?

1.Should Low cardinality comes first (CUST_CATEGORY, CUST_ID) 
2.Hifh frequency comes first (CUST_ID, CUST_CATEGORY)
3.should chose only single column eigher of CUST_ID or CUST_CATEGORY

And generally, may I ask what is the best combination in such case?

1.DISTSTYLE KEY & COMPOUND SORTKEY
2.DISTSTYLE KEY & SORTKEY
3.DISTSTYLE ALL & INTERLEAVED SORTKEY...

I imagine if we would find 'DS_BCAST_INNER' in query plan frequently, we'd better to consider 'DISTSTYLE ALL'. But still not sure what can be best practice for master tables.

Any suggestion would be appreciated again.

Best Regards

what is your most common join between 2 large tables? what is your most common where clause criteria (what do you always use - eg date range) — Jon Scott
If you're going to be joining tables together then you should try and match the distribution key of the larger table, or consider using DIST ALL for the smaller table. You can't really make distribution decisions in isolation for each table, you need to think about what sort of queries are going to run on them and distribute the data across all the nodes such that each node ends up doing approximately the same amount of work. — Nathan Griffiths
@Jon Scott Thanks a lot for your comments, Jon. Customer_ID is more frequent seek and join to other transaction tables, but Customer_Category is still used in our BI for analysis purposes. So I was wondering cardinality should be considered or not. Maybe frequency prioritized, Customer_ID should come first. Thank you. — Sachiko
@Nathan Thanks a lot for your considerable comments again, Nathan. As you said we’d better to consider more about the actual queries and balance on the nodes. We’ll try key matching to other tables or distribution all. Thank you so much! — Sachiko

Nate Nate · Accepted Answer · 2018-10-05T15:31:57

If you join often by customer_id then make that the sort key. Customer_category would make a good secondary sort key. The data should be distributed together with data that it will be joined to. If you have fact tables distributed by customer Id then make that the distribution key. It also depends on how many new customers you get. If sales are heavily skewed towards new customers then your data distribution will be skewed and customer_id would not be a good distribution key.

Redshift : What can be appropriate combination of DIST KEY and SORT KEY for Master Tables?

1 Answers