1
votes

I am trying to create a dataflow template and run it via the DataFlow Cloud UI, and after executing the pipeline via command-line with the dataflow runner, it works correctly (i.e. the right data appears in the right places) but there are no "pre-compiled" template/staging files appearing in the Google Cloud Storage Bucket.

I did see this, but the post never mentions a resolution and I did include the parameter mentioned therein.

My command to run is python apache_beam_test.py --runner DataflowRunner --project prototypes-project --staging_location gs://dataflow_templates/staging --temp_location gs://dataflow_templates/temp --template_location gs://dataflow_templates/

I do get a warning regarding the options however:

C:\Python38\lib\site-packages\apache_beam\io\gcp\bigquery.py:1677: BeamDeprecationWarning: options is deprecated since First stable release. References to .options will not be supported experiments = p.options.view_as(DebugOptions).experiments or []

C:\Python38\lib\site-packages\apache_beam\io\gcp\bigquery_file_loads.py:900: BeamDeprecationWarning: options is deprecated since First stable release. References to .options will not be supported temp_location = p.options.view_as(GoogleCloudOptions).temp_location

Does that mean my command-line arguments will not be interpreted and if so, how do I get the DataFlow/Beam templates into my GCS so I can reference them from the DataFlow UI and run them again later on?

Help much appreciated!

1

1 Answers

1
votes

The problem was indeed that the cli flags needed to be explicitly passed into the pipeline options.

As I did not have any custom flags added to my project, I wrongly assumed Beam would handle the standard flags automatically, but this was not the case.

Basically, you have to follow this even if you have no new parameters to add.

I assumed that step is optional (which it technically is if you only want to execute a pipeline without any runtime parameters once), but in order to reuse and monitor the pipelines in Dataflow UI you have to stage them first. That in turn requires passing a staging location into the pipeline.

Also, as far as I understand, execution of the pipeline requires a service account, while uploading the staging files requires Google Cloud SDK authentication.