BigQuery table design best practices: combination of date partitioning and sharding?

Question

We're in the midst of migrating our on-prem Oracle db to the cloud. The biggest project is to move our fact table that tracks customer transactions.

Short Question: What is best way to shard/partition a fact table in BigQuery when you can't use a date field for partitioning because of 4,000 partition limit? Objective is to maximize query performance and minimize costs.

Detailed Question I do not want to duplicate the table in BigQuery, because I want it to be optimized for BigQuery. So I've been looking into partitioning, sharding and clustering. Also looking into denormalization, but that's a different question.

In our Oracle db, we simply partition by an integer date YYYYMMDD. I don't believe we can do this in BigQuery, however, due to the fact a table can only have 4,000 partitions. If we partition by day, our table can only contain a little less than 11 (4000/365) years worth of data -- which is well below what we currently need to migrate.

There are certainly other fields we could partition besides date (eg site location), but I believe date might be best.

Below are the options I'm considering. Let's say the table contains a datetime column order_date and the integer date version order_date_id

Shard by year (ie all orders with order_date in 2001 go into my_table_2001, partition each table by order_date
No sharding, have one big table and create an integer column for year (order_year) and use it for partition column
Shard by another column (eg site location), then partition by order_year
Shard by order_year and another column (eg site location), partition by order_date

If I'm going to shard tables, I definitely want to use a datetime column for partitioning so I can use wildcards to query all sharded tables. I found out using integer ranges for partitioning prevents you from using wildcards.

Also important to note that business users may want to consistently query data for large date ranges if not all available data.

That's an excellent question - did you consider this already? stackoverflow.com/questions/58743050/… — Felipe Hoffa
I'm with Felipe Hoffa to suggest looking at clustering, and, if you have full control of query, use a combination of partitioning by month and clustering by date. — Yun Zhang

Claudio Claudio · Accepted Answer · 2020-03-25T08:30:21

You can consider to use the integer range partition table: https://cloud.google.com/bigquery/docs/creating-integer-range-partitions

This is a recent GA released function. You can split your integer date (YYYYMMDD) in two columns: YYYYMM and DD and make partition on YYYYMM in this way you can have 4000/12 = 333 partitions. If you need to query a single day, you can set a where clause in you query. To save the query cost, a best practice, is select only the columns of interest and set where clause (when possible).

BigQuery table design best practices: combination of date partitioning and sharding?

1 Answers