I have a quite large table (1.5TB, 18 billion records) that holds IoT type of data. In general, it has EVENT_TYPE (VARCHAR), EVENT_TIME (TIMESTAMP), and some columns storing event data. There are 12 event types in total with varying record distribution (from 5-10 million to 1-5 billion).
Most of the queries are filtering on specific event type and date range. Also, there are views created pointing to the data for specific event type. These are usually queried by date range.
There is clustered index created on EVENT_TYPE, CAST(EVENT_TIME AS DATE). I thought it would be most suitable for this type of workload. However, the performance is not very good.
For example, query like SELECT COUNT(1) FROM <event table> WHERE EVENT_TIME >= '2020-01-01' AND EVENT_TYPE = '<some type>'
runs for 30 seconds on XL warehouse. I would expect better performance, as it 's using indexed columns. Also, it is scanning 25% of the partitions (42,786 out of 174,201), while it covers only 0.5% of total data.
I suspect that clustered index is not optimal. Here are initial results from SYSTEM$CLUSTERING_INFORMATION:
{
"cluster_by_keys" : "LINEAR(EVENT_TYPE, CAST(EVENT_TIME AS DATE))",
"total_partition_count" : 175792,
"total_constant_partition_count" : 44575,
"average_overlaps" : 97273.7777,
"average_depth" : 93801.4483,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 44536,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0,
"131072" : 130790,
"65536" : 466
}
}
I have tried re-clustering the table, but it did not improve much:
{
"cluster_by_keys" : "LINEAR(EVENT_TYPE, CAST(EVENT_TIME AS DATE))",
"total_partition_count" : 173905,
"total_constant_partition_count" : 55880,
"average_overlaps" : 78938.3633,
"average_depth" : 74663.1889,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 55829,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0,
"04096" : 5,
"08192" : 7,
"131072" : 117196,
"16384" : 15,
"32768" : 529,
"65536" : 324
}
}
Looking into results above I see a couple of issues: 1. average_overlaps and average_depth appears to high 2. The histogram is skewed to the end. I would expect more or less even distribution
Any ideas on what might be wrong, or how to improve it?
Thanks, Rimvis
alter table <table name> recluster
It's a bit strange that it works, because according to Snowflake documentation "As of May, 2020, manual reclustering has been deprecated for all accounts." docs.snowflake.com/en/user-guide/tables-clustering-manual.html What is also interesting, that average_overlaps and average_depth keeps decreasing when I run RECLUSTER. – Rimvydas Gurskis