I have an apache beam pipeline in python that has a flow like below for whatever reason.
client = bigquery.Client()
query_job1 = client.query('create table sample_table_1 as select * from table_1')
result1 = query_job.result()
with beam.Pipeline(options=options) as p:
records = (
p
| 'Data pull' >> beam.io.Read(beam.io.BigQuerySource(...))
| 'Transform' >> ....
| 'Write to BQ' >> beam.io.WriteToBigQuery(...)
)
query_job2 = client.query('create table sample_table_2 as select * from table_2')
result2 = query_job2.result()
SQL Job --> Datapipeline --> SQL Job
This sequence works fine when I run this locally. However when I was trying to run this as a Dataflow pipeline, it doesn't really run it in this order.
Is there a way to force the dependencies while running on dataflow?
create_disposition=CREATE_IF_NEEDED
to let Dataflow create a new table before writing to it. If this doesn't work for you, then it'd help to know what query_job 1 and 2 are for. beam.apache.org/releases/pydoc/2.23.0/… – Peter Kim