Dataflow job uses same BigQuery job ID when deploying using a staged template multiple times?

Question

I am attempting to deploy a Dataflow job that reads from BigQuery and writes to Cassandra on a fixed schedule. The template code has been written in Java using Apache Beam, and the Dataflow library. I have staged the template onto Google Cloud Storage, and have configured a Cloud Scheduler instance as well as Cloud function used to trigger the Dataflow template. I am using the latest version for all Beam and BigQuery dependencies.

However, I have discovered that when deploying a job using the same staged template, the BigQuery extract job seems to always use the same job ID, which causes a 409 failure shown in the logs. The BigQuery query job seems to be successful, because the query job ID has a unique suffix appended, while the extract job ID uses the same prefix, but without a suffix.

I have considered two alternate solutions: either using a crontab to deploy the pipeline directly on a compute engine instance to deploy the template directly, or adapting a Cloud function to perform the same tasks as the Dataflow pipeline on a schedule. Ideally, if there is a solution for changing the extract job ID in the Dataflow job it would be a much easier solution but I'm not sure if this is possible? Also if this is not possible, is there an alternate solution that is more optimal?

It's hard to make any assertions here, as there's not enough detail, but yes it should be fine not having a hardcoded reference. Is the extract happening as a byproduct of some particular usage of BigQueryIO for reading, or is the pipeline defining it explicitly, etc? — shollyman
The extract is happening as a result of a BigQueryIO.readTableRows() call as part of a pipeline step. A query job runs right before the extract job, which writes to a temporary table, and then the extract job extracts the results into a TableRow format. It occurs right before a ParDo function that transforms the TableRow, but the ParDo step is never executed since the read step fails. — eagerbeaver

shollyman shollyman · Accepted Answer · 2020-06-30T20:12:01

Based on the additional description, it sounds like this may be a case of not using withTemplateCompatability() as directed?

Usage with templates

When using read() or readTableRows() in a template, it's required to specify BigQueryIO.Read.withTemplateCompatibility(). Specifying this in a non-template pipeline is not recommended because it has somewhat lower performance.

Dataflow job uses same BigQuery job ID when deploying using a staged template multiple times?

1 Answers