Redshift: DIST KEY and SORT KEY strategy for Range Joins

Question

I have a slow changing dimension representing all our article masterdata changes, and is pretty voluminous: 15 billions rows and growing.

The table is currently distributed over natural ensembles such as (country, supplier).

Because of the nature of the table, most of the queries using it are range Joins such as trivaially counting orders over a changing article attribute:

SELECT x.article_id, x.changing_article_season, COUNT(*) counting_orders
FROM article_slow_changing_dimension x
LEFT JOIN orders y ON x.article_id=y.article_id
AND y.order_timestamp BETWEEN x.from_timestamp AND y.to_timestamp

What could be a interesting strategy for choice of the Sort key here? I was thinking of doing SORTKEY(from_timestamp, to_timestamp) but I am not sure.

I tried a few things, but any test takes a long time to set up and is actually hard to assess empirically. Any idea?

EDIT: adding a few details based on comments 1/ the tables are vacuumed 2/ the cluster is pretty small (4 nodes) and the query runs pretty fast but it is not in production so it is basically only me the devs running a few queries. I would like to optimize before going to production 3/ there are roughly 15 billion rows right now, and aggregating for a specific timestamp is taking 1min; But I would like to push that down to 20sec

specifically - check the syntax of "BETWEEN x.from_timestamp=y.to_timestamp" - is this correct? how many rows in orders? what is the size of your redshift cluster? How long does it take to run approx? are your tables vacuumed and analyzed? please edit your question with these things. — Jon Scott
my bad; I fixed the example query; the conceptual issue remains :) — Breathe
you are joining between x.from_timestamp AND y.to_timestamp ?? so the from timestamp is on x table and the to_timestamp is on y table? are you sure about this? — Jon Scott
you say "the cluster is pretty small (4 nodes) and the query runs pretty fast but it is not in production so it is basically only me the devs running a few queries. I would like to optimize before going to production" - unfortunately this will be very hard to do on test data - run it on prod then come back with info on performance and table sizes. — Jon Scott

Joe Harris Joe Harris · Accepted Answer · 2019-02-06T15:20:42

Great question.

A little background, sort keys have 2 primary purposes: 1) minimize data scanned from disk and 2) enable joins between large tables to use a merge join (fastest join). https://docs.aws.amazon.com/redshift/latest/dg/query-performance-improvement-opportunities.html

SORTKEY(from_timestamp, to_timestamp) is usually a very good choice but it won't improve the performance of your example query. It's more helpful in the case where you use these fields in a predicate like WHERE from_timestamp > '2019-01-01' AND to_timestamp < current_date.

There is a limit to how much you can optimize this kind of range join because the database has to treat it like a cartesian product (aka "CROSS JOIN" - join every row from a with every row from b). You know that the join will match a single row but the database doesn't know.

In a fully dimensional DW I would make an article_sk surrogate key. That value would resolve to exactly one value in the SCD. This complicates the ETL process though because you have to inject the surrogate key during processing.

Another thing you could do is distribute both tables using the article column. That allows the join to be completed on each slice in parallel. However, article probably won't be the natural distribution key for your orders fact table (usually that would be customer or account).

Redshift: DIST KEY and SORT KEY strategy for Range Joins

1 Answers