Google BigQuery: streaming & clustering at the same time

Question

I have a process that streams data to a few different tables(ingestion-time partitioned). I try to replace them by creating clustered equivalents.

Encouraged by an excellent article, I've started to improve queries performance. I've created new tables with corresponding schema, proper clustering fields and set up streaming.

Just to mention, I did some tests before with tables into which data were loaded and queries got the boost. After two days of streaming, I've noticed that there is no gain using the new setup. As I understand from the topic and the other one and the issue, clustering with streaming don't give extra gain without additional effort. Am I right or not? I thought about systematic re-clustering of previous day partition, but still no gain for querying the most recent data.

What would be the best way to make those two features work together to improve queries performance? Is there a way to re-cluster data that do not have a real key to use in DML MERGE statement?

About clustering with streaming, yes, streaming doesn't work as well with clustered tables, as ordering is weakened. What do you mean with "no real key"? The MERGE statement is the one that would help you; its use is well explained in the link you provided. — asbovelw
I might have misunderstood the following part: You can re-cluster the data in the entire table by running a SELECT * query that selects from and overwrites the table (or any specific partition in it)., cloud.google.com/bigquery/docs/clustered-tables. As I understand, the idea is to run something like: MERGE my_table USING (SELECT * FROM my_table) ON <expression_here> "no real key" – I meant that I have no set of columns that may be used as key. Or the idea was to match on equality of all columns. But, such JOIN will take long. — Damian Bodnar
The idea of the part you mentioned is that you have already clustered some of your data, so every time you run the "SELECT *" you order also the data which added after the last clustering. So, after that you will be able to search more efficient into your bigquery data. Isn't it that you are interested of doing? — TasosV
Thanks for the response! It was a little bit unclear to me that SELECT * sorts data in background. All the time I was convinced that I need to call MERGE SELECT *, not just SELECT *. Thanks again for clarification! It may be my fault, but probably the docs could be more precise at the specific section. — Damian Bodnar

TasosV TasosV · Accepted Answer · 2019-08-21T10:18:51

The idea of the clustering partitioned tables is that every time you have already clustered some of your data

You have only to run the

SELECT *

and the data which added after the last clustering will be in order.

After this you will be able to search more efficient into your bigquery data.

Google BigQuery: streaming & clustering at the same time

1 Answers