Cluster key selection on Snowflake

Question

(Submitting on behalf of a Snowflake User)

QUESTION: Why would the filter or the search key(key used in where clause) would be a better choice for cluster key than an order by or group by key.

One resource recommends reading: https://support.snowflake.net/s/article/case-study-how-clustering-can-improve-your-query-performance

Another resource mentions:

The performance of query filter will be better because the data is sorted it would skip all the rows which are not required.

For the scenario which has query filter on columns which are not part of sort order but the columns in group by and order by are part of data sort order (clustered keys), it may take time to select those data but the sorting would be easy since the data is already in an order.

A 3rd resource states:

The clustering key is important for the WHERE clause when you only select a small portion of the overall data that you have in your tables, because it can reduce the amount of data that has to be read from the Storage into the Compute when the Optimizer can use the clustering key for Query Pruning.

You can alternatively use the clustering key to optimize table inserts and possibly also query output (eg sort order).

Your choice should depend on your priorities, there is no cure all unless a single key covers all above.

To which the User responds with the following questions:

If I always insert the rows in the order in which they will be retrieved, do I still need to create a cluster key? For example if a table is always queried using a date_timestamp and if I ensure that I am inserting in the table order by date_timestamp, do I still need to create a cluster key on date_timestamp?

Any thoughts, recommendations, etc.? Thanks!

Nat Taylor Nat Taylor · Accepted Answer · 2019-10-17T16:53:10

For choosing a cluster key based on FILTER/GROUP/SORT. The first "resource" is right. If the filter will result in pruning, then it is probably best (so that data can be skipped.) If all/most of the data must be read, then clustering on a GROUP/SORT key is probably fast (so less time is spent re-sorting) These docs state:

Typically, queries benefit from clustering when the queries filter or sort on the clustering key for the table. Sorting is commonly done for ORDER BY operations, for GROUP BY operations, and for some joins.

For the second question on natural clustering, there would be little to no performance benefit for defining a cluster key in that case.

Cluster key selection on Snowflake

1 Answers