Using SYSTEM$CLUSTERING_INFORMATION to identify potential clustering keys

Question

Queries run on my event based database currently scan all rows even if a certain event is filtered - leading to long scan times. Event_type is something that I would use often in filters which is why I think it might be a good thing to cluster on. The table is already clustered by date_id and app_title. I used the SYSTEM$CLUSTERING_INFORMATION to see if clustering on the additional event_type column would be useful.The results were bad. Does this mean that this would be a bad choice? Or does it just mean that the current table is poorly clustered on this key? Would creating a table with these three cluster keys lead to different results?

(I changed some names and values in the query/results below)

select system$clustering_information('materialized_view', '(date_id, app_title, event_type)');
 
{
  "cluster_by_keys" : "LINEAR(date_id, app_title, event_type)",
  "total_partition_count" : <more than 100k>,
  "total_constant_partition_count" : 0,
  "average_overlaps" : ~500
  "average_depth" : ~500,
  "partition_depth_histogram" : {
    "00000" : 0,
    "00001" : 0,
    "00002" : 0,
    "00003" : 0,
    "00004" : 0,
    "00005" : 0,
    "00006" : 0,
    "00007" : 0,
    "00008" : 0,
    "00009" : 0,
    "00010" : 0,
    "00011" : 0,
    "00012" : 0,
    "00013" : 0,
    "00014" : 0,
    "00015" : 0,
    "00016" : 0,
    "00064" : 30,
    "00128" : 3218,
    "00256" : 22146,
    "00512" : 94367,
    "01024" : 134114
  }
}

Greg Pavlik Greg Pavlik · Accepted Answer · 2021-03-09T20:20:02

This is showing the current state of the clustering, which is not good. That means creating a cluster key the way you have it defined may help.

The order of the columns (or expressions) in the cluster key is very important. You want to go from lower cardinality to higher cardinality. If, for example, you have only five event types then it should probably be the first in the list of columns.

The APP_TITLE column is more interesting without the context. If it's got high cardinality (which the name of the column seems to suggest), you can limit the cardinality using an expression such as left(APP_TITLE, 2).

Remember, if you need to set a key on a very high cardinality or unique column, reduce the cardinality using an expression. You can see which functions Snowflake supports in cluster keys this way:

show functions; 
-- Look at the "valid_for_clustering" column to see which are allowed.

Using SYSTEM$CLUSTERING_INFORMATION to identify potential clustering keys

1 Answers