2
votes

Wanted some advice on how to deal with table operations (rename column) in Google BigQuery.

Currently, I have a wrapper to do this. My tables are partitioned by date. eg: if I have a table name fact, I will have several tables named:

fact_20160301

fact_20160302

fact_20160303... etc

My rename column wrapper generates aliased queries. ie. if I want to change my table schema from ['address', 'name', 'city'] -> ['location', 'firstname', 'town']

I do batch query operation: select address as location, name as firstname, city as town

and do a WRITE_TRUNCATE on the parent tables.

My main issues lies with the fact that BigQuery only supports 50 concurrent jobs. This means, that when I submit my batch request, I can only do around 30 partitions at a time, since I'd like to reserve 20 spots for ETL jobs that are runnings.

Also, I haven't found of a way where you can do a poll_job on a batch operation to see whether or not all jobs in a batch have completed.

If anyone has some tips or tricks, I'd love to hear them.

2
What is use case behind this select with "rename" and write_truncate? Sounds not very practical. Understanding your case would help to answer your questionMikhail Berlyant
We are replacing vertica with bigquery. Our team gets requests for renaming columns at times, so we rename columns throughout the board so that it doesn't break any upstream or downstream jobs when they query on that particular column.Jeff Mao
have you considered using views? it is zero cost vs. executing kind of 'select into' where you pay for whole table scan. still some coding though :)Mikhail Berlyant
Thanks for the suggestion, but don't think views would help. Btw noticed you were the author of BQ-mate. Great extension :). All of our team who work on BQ have it installed.Jeff Mao
that's great! glad it is being used widely! thank you!Mikhail Berlyant

2 Answers

3
votes

I can propose two options

Using View

Views creation is very simple to script out and execute - it is fast and free to compare with cost of scanning whole table with select into approach.
You can create view using Tables: insert API with properly set type property

Using Jobs: insert EXTRACT and then LOAD

Here you can extract table to GCS and then load it back to GBQ with adjusted schema

Above approach will a) eliminate cost cost of querying (scan) tables and b) can help with limitations. But might not depends on the actual volumke of tables and other requirements you might have

0
votes

The best way to manipulate a schema is through the Google Big Query API.

  1. Use the tables get api to retrieve the existing schema for your table. https://cloud.google.com/bigquery/docs/reference/v2/tables/get
  2. Manipulate your schema file, renaming columns etc.
  3. Again using the api perform an update on the schema, setting it to your newly modified version. This should all occur in one job https://cloud.google.com/bigquery/docs/reference/v2/tables/update