I have several tables that are populated by a GCP DataFlow streaming app as part of some data pipelines (the fact that it is DataFlow is not that relevant here apart from the fact that it is being populated on a semi-regular basis in stream mode). These tables are used by downstream processes that depend on the name of the table.
I need to evolve the schema of these tables in a productionised way. Following BQ documentation advice (https://cloud.google.com/bigquery/docs/manually-changing-schemas#option_2_exporting_your_data_and_loading_it_into_a_new_table), I intend to export the current table in AVRO format to GCS and then create a new table* based on the new backwards-compatible schema and the finally load the AVRO export into the new table before then overwriting the original table.
* The reason I create a new table rather than writing over the same table is because I need to make sure this operation succeeds across several projects that I am coordinating this schema evolution across before I "update" the actual table. In any case, I believe I'd have the same problem if I tried to update the table in place.
The Problem
The problem is that between my export starting and the load finishing, my DataFlow app could have updated the original table (it works in an INSERT / OVERWRITE PARTITION fashion). This is then a problem because whilst I process the schema change, I will lose this data.
How can I safely update my table schema without batch transactions / distributed transactions / table locks? As mentioned in the above * block, I have the additional complexity of needing to use an intermittent table to ensure my operation will work across all projects before I process it into the table that downstream stuff is depending on.
The only option I can think of is custom-implement the behaviour that I would get through a lock - but through co-operation. I.e. my schema update process can send a message to DataFlow to tell it to hold off until its finished its thing.