I have a process that streams data to a few different tables(ingestion-time partitioned). I try to replace them by creating clustered equivalents.
Encouraged by an excellent article, I've started to improve queries performance. I've created new tables with corresponding schema, proper clustering fields and set up streaming.
Just to mention, I did some tests before with tables into which data were loaded and queries got the boost. After two days of streaming, I've noticed that there is no gain using the new setup. As I understand from the topic and the other one and the issue, clustering with streaming don't give extra gain without additional effort. Am I right or not? I thought about systematic re-clustering of previous day partition, but still no gain for querying the most recent data.
What would be the best way to make those two features work together to improve queries performance? Is there a way to re-cluster data that do not have a real key to use in DML MERGE statement?
MERGE my_table USING (SELECT * FROM my_table) ON <expression_here>
"no real key" – I meant that I have no set of columns that may be used as key. Or the idea was to match on equality of all columns. But, such JOIN will take long. – Damian BodnarSELECT *
sorts data in background. All the time I was convinced that I need to callMERGE SELECT *
, not justSELECT *
. Thanks again for clarification! It may be my fault, but probably the docs could be more precise at the specific section. – Damian Bodnar