We're in the midst of migrating our on-prem Oracle db to the cloud. The biggest project is to move our fact table that tracks customer transactions.
Short Question: What is best way to shard/partition a fact table in BigQuery when you can't use a date field for partitioning because of 4,000 partition limit? Objective is to maximize query performance and minimize costs.
Detailed Question I do not want to duplicate the table in BigQuery, because I want it to be optimized for BigQuery. So I've been looking into partitioning, sharding and clustering. Also looking into denormalization, but that's a different question.
In our Oracle db, we simply partition by an integer date YYYYMMDD
. I don't believe we can do this in BigQuery, however, due to the fact a table can only have 4,000 partitions. If we partition by day, our table can only contain a little less than 11 (4000/365) years worth of data -- which is well below what we currently need to migrate.
There are certainly other fields we could partition besides date (eg site location), but I believe date might be best.
Below are the options I'm considering. Let's say the table contains a datetime
column order_date
and the integer date version order_date_id
- Shard by year (ie all orders with
order_date
in 2001 go intomy_table_2001
, partition each table byorder_date
- No sharding, have one big table and create an integer column for year (
order_year
) and use it for partition column - Shard by another column (eg site location), then partition by
order_year
- Shard by
order_year
and another column (eg site location), partition byorder_date
If I'm going to shard tables, I definitely want to use a datetime
column for partitioning so I can use wildcards to query all sharded tables. I found out using integer ranges for partitioning prevents you from using wildcards.
Also important to note that business users may want to consistently query data for large date ranges if not all available data.