0
votes

I have a set of historical data tables that have different but 'compatible' schemas. What I mean is that over time, the schema has evolved and more fields got added (not removed). I like to run a query over the union of all.

Doing so using legacy SQL requires multiple SELECT statements joined by UNION and dummy NULL placeholders for new fields that did not exist in earlier tables. I have 100 of those combo queries to build & execute.

I could script this but was hoping using Standard SQL wildcard tables instead. I read however at https://cloud.google.com/bigquery/docs/reference/legacy-sql#tablewildcardfunctions :

In order to execute a standard SQL query that uses a wildcard table, BigQuery automatically infers the schema for that table. BigQuery uses the schema for the most recently created table that matches the wildcard as the schema for the wildcard table. If the schema is inconsistent across tables matched by the wildcard table, BigQuery returns an error.

Not sure how strict is meant inconsistent. Would it support incremental schema as per above use case or do all tables need to share the exact same schema ?

If not possible, another alternative would be to (re)load from source into a new table using the schemaUpdateOptions ALLOW_FIELD_ADDITION. It means redesign and rework.

Any other suggestion is welcome. FYI - I use Airflow to execute the queries so I rather use a Pythonic solution.

1
Have you tried running the wildcard table query? Do you get an error?Elliott Brossard

1 Answers

0
votes

BigQuery API has patch tables functionality, where you can patch/extend all your old tables with the schema changes that were added by the new structure.

By patching the old tables you just add NULL like columns to past schema and data is not deleted.

So what you need to do is write a script that takes the most recent schema, and incrementally patches all previous tables to the new schema and then you can run your UNION queries.

Related: How do I use BigQuery patch?