1
votes

I have several tables that are populated by a GCP DataFlow streaming app as part of some data pipelines (the fact that it is DataFlow is not that relevant here apart from the fact that it is being populated on a semi-regular basis in stream mode). These tables are used by downstream processes that depend on the name of the table.

I need to evolve the schema of these tables in a productionised way. Following BQ documentation advice (https://cloud.google.com/bigquery/docs/manually-changing-schemas#option_2_exporting_your_data_and_loading_it_into_a_new_table), I intend to export the current table in AVRO format to GCS and then create a new table* based on the new backwards-compatible schema and the finally load the AVRO export into the new table before then overwriting the original table.

* The reason I create a new table rather than writing over the same table is because I need to make sure this operation succeeds across several projects that I am coordinating this schema evolution across before I "update" the actual table. In any case, I believe I'd have the same problem if I tried to update the table in place.

The Problem

The problem is that between my export starting and the load finishing, my DataFlow app could have updated the original table (it works in an INSERT / OVERWRITE PARTITION fashion). This is then a problem because whilst I process the schema change, I will lose this data.

How can I safely update my table schema without batch transactions / distributed transactions / table locks? As mentioned in the above * block, I have the additional complexity of needing to use an intermittent table to ensure my operation will work across all projects before I process it into the table that downstream stuff is depending on.

The only option I can think of is custom-implement the behaviour that I would get through a lock - but through co-operation. I.e. my schema update process can send a message to DataFlow to tell it to hold off until its finished its thing.

1
Do you want to add fields or to update/remove existing fields? Can you stop your Dataflow for some minutes? - guillaume blaquiere
I don't want to have to analyse exactly what change is needed; at this stage in the pipeline, we have confidence that the new AVRO schema is backwards compatible but it could be n combinations of changes, deletions, default policy changes, new, relaxed modes etc. etc. I know that a couple of these changes can be done without recreating the table (we'll do that in those instances) but want a generic strategy for all others. "stopping" the dataflow app is essentially what the co-operative lock idea is that I described above - certainly an option. - Adam Cunnington
If you only add field, you can simply update the schema in BigQuery. The schema update is problematic when you update field or want to remove some. (that's why my question). So, the response is very dependent of your data management. You speak about partition. Are you writing only in the partition of the day or yesterday? If yes, you can safely export and import all the unchanged partition, and then you only have to perform a small diff, maybe to perform with a query (the cost will be low -> Solution 1 in the migration doc). If no, a specific process is to implement. - guillaume blaquiere

1 Answers

0
votes

I take this question as "How to safely evolve table schema while a table has streaming updates".

Components affected by schema change:

  1. DataFlow streaming APP (or Streaming source in general)
  2. BigQuery table (streaming target)
  3. User of the table

I suggest to do below, in order to migrate schema from V1 to V2

Components

  1. Versioned DataFlow app
  2. Versioned BigQuery table
  3. (Optional) To minimize impact of user, a BigQuery view to hide versioned table detail

Migration Steps

  1. (Pre-step) Having a MERGE command ready/tested which merges data from V1 table to V2 table.
  2. Switch DataFlow app to V2 and write to a new V2 table.
  3. Run MERGE command to finish data migration from V1 table to V2 table.
  4. (If a view is used) Update the view to point to V2 table.
  5. Delete V1 table after verification is done.

Migration Steps Considering Rollback

  1. (Pre-step) Having a MERGE command ready/tested which merges data from V1 table to V2 table.
  2. Having DataFlow app V1 writing to V1 table and V2 write to a new V2 table, simultaneously.
  3. Run MERGE command to finish data migration from V1 table to V2 table.
  4. Verification: good, stop DataFlow app V1 and delete table V1.
  5. Verification: bad: stop DataFlow app V2 and delete table V2.
  6. (If a view is used) Update the view to point to V2 table.