I have a slow changing dimension representing all our article masterdata changes, and is pretty voluminous: 15 billions rows and growing.
The table is currently distributed over natural ensembles such as (country, supplier).
Because of the nature of the table, most of the queries using it are range Joins such as trivaially counting orders over a changing article attribute:
SELECT x.article_id, x.changing_article_season, COUNT(*) counting_orders
FROM article_slow_changing_dimension x
LEFT JOIN orders y ON x.article_id=y.article_id
AND y.order_timestamp BETWEEN x.from_timestamp AND y.to_timestamp
What could be a interesting strategy for choice of the Sort key here? I was thinking of doing SORTKEY(from_timestamp, to_timestamp) but I am not sure.
I tried a few things, but any test takes a long time to set up and is actually hard to assess empirically. Any idea?
EDIT: adding a few details based on comments 1/ the tables are vacuumed 2/ the cluster is pretty small (4 nodes) and the query runs pretty fast but it is not in production so it is basically only me the devs running a few queries. I would like to optimize before going to production 3/ there are roughly 15 billion rows right now, and aggregating for a specific timestamp is taking 1min; But I would like to push that down to 20sec