1
votes

I'm getting daily exports of Google Analytics data into BigQuery and these form the basis for our main reporting dataset.

Over time i need to add new columns for additional things we use to enrich the data - like say a mapping from url to 'reporting category' for example.

This is easy to just add as a new column onto the processed tables (there is about 10 processing steps at the moment for all the enrichment we do).

This issue is if stakeholders then ask - can we add that new column to the historical data?

Currently i then need to rerun all the daily jobs which is very slow and costly.

This is coming up frequently enough that i'm seriously thinking about redesigning my data pipelines to tailor for the fact that i often need to essentially drop and recreate ALL the data from time to time when i need to add a new field or correct old dirty data or something.

I'm just wondering if there is better ways to

  1. Add a new column to an old table in BQ (would be happy to do this by hand for these instances where i can just join the new column based on the ga [hit_key] i have defined which is basically a row key)
  2. (Less common) Update existing tables based on some where condition.

Just wondering what best practices are and if anyone has had similar issues where you basically need to update an historic shema and if there are ways to do it without just dropping and recreating which is essentially what i'm currently doing.

To be clearer on my current approach: I'm taking the [ga_sessions_yyyymmdd] table and making a series of [ga_data_prepN_yyyymmdd] tables where is either add new columns at each step or reduce the data in some way. There is now 11 of these steps and each time i'm taking all the 100 or more columns along for the ride. This is what i'm going to try design away from as currently 90% of the columns at each stage dont even need to be touched as they can just be joined back on at the end maybe based on hit_key or something.

It's a little bit messy though to try and pick apart.

1

1 Answers

0
votes

Adding new columns to the schema of the existing historical tables is possible, but the values for newly added columns will be NULLs. If you do need to populate values into these columns, probably the best approach is to use UPDATE DML statement. More details how to try it out is here: Does BigQuery support UPDATE, DELETE, and INSERT (SQL DML) statements?