3
votes

For streaming inserts, I want to use a template table (with user id suffix) which is itself a Partitioned table. This way I can make my tables smaller than just using Partitioned Tables and hence make my queries more cost-effective. Also my query cost per user stays constant irrespective of the number of users in my system. As per the documentation at https://cloud.google.com/bigquery/streaming-data-into-bigquery:-

To create smaller sets of data by date, use time-partitioned tables. To create smaller tables that are not date-based, use template tables and BigQuery creates the tables for you.

It sounds as if it can either be a time-partitioned table OR a template table. Can it not be both? If not, is there another architecture that I should look into?

One more concern regarding my above proposed architecture is the 4000 limit that I saw on https://cloud.google.com/bigquery/docs/partitioned-tables . Does it mean that my partitioned table can't cover more than 4000 days? Will I have to delete old partitions in this case or will the last partition keep storing any subsequent streamed data?

1
See the following question to see how to get around this limit by partitioning by week/month/year: stackoverflow.com/a/56125049/132438 - Felipe Hoffa

1 Answers

3
votes

You should look into Clustered Tables on partitioned tables.

With that you can have ONE table with all users in it, partitioned by time, and clustered by user_id as you would use in a template table.

Introduction to Clustered Tables

When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.

Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query containing a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data.

Similarly, when you submit a query that aggregates data based on the values in the clustering columns, performance is improved because the sorted blocks colocate rows with similar values.

Clustered table pricing

When you create and use clustered tables in BigQuery, your charges are based on how much data is stored in the tables and on the queries you run against the data. Clustered tables help you to reduce query costs by pruning data so it is not processed by the query.