1
votes

I am attempting to deploy a Dataflow job that reads from BigQuery and writes to Cassandra on a fixed schedule. The template code has been written in Java using Apache Beam, and the Dataflow library. I have staged the template onto Google Cloud Storage, and have configured a Cloud Scheduler instance as well as Cloud function used to trigger the Dataflow template. I am using the latest version for all Beam and BigQuery dependencies.

However, I have discovered that when deploying a job using the same staged template, the BigQuery extract job seems to always use the same job ID, which causes a 409 failure shown in the logs. The BigQuery query job seems to be successful, because the query job ID has a unique suffix appended, while the extract job ID uses the same prefix, but without a suffix.

I have considered two alternate solutions: either using a crontab to deploy the pipeline directly on a compute engine instance to deploy the template directly, or adapting a Cloud function to perform the same tasks as the Dataflow pipeline on a schedule. Ideally, if there is a solution for changing the extract job ID in the Dataflow job it would be a much easier solution but I'm not sure if this is possible? Also if this is not possible, is there an alternate solution that is more optimal?

1
It's hard to make any assertions here, as there's not enough detail, but yes it should be fine not having a hardcoded reference. Is the extract happening as a byproduct of some particular usage of BigQueryIO for reading, or is the pipeline defining it explicitly, etc? - shollyman
The extract is happening as a result of a BigQueryIO.readTableRows() call as part of a pipeline step. A query job runs right before the extract job, which writes to a temporary table, and then the extract job extracts the results into a TableRow format. It occurs right before a ParDo function that transforms the TableRow, but the ParDo step is never executed since the read step fails. - eagerbeaver

1 Answers

3
votes

Based on the additional description, it sounds like this may be a case of not using withTemplateCompatability() as directed?

Usage with templates

When using read() or readTableRows() in a template, it's required to specify BigQueryIO.Read.withTemplateCompatibility(). Specifying this in a non-template pipeline is not recommended because it has somewhat lower performance.