2
votes

Once an Apache Beam pipeline designed and tested in Google’s cloud Dataflow using Python SDK and DataflowRunner what is a convenient way to have it in the Google cloud and manage its execution?

What is a convenient way to deploy and manage execution of a Python SDK Apache Beam pipeline for Google Cloud Dataflow?

Should it be somehow packaged? Uploaded to Google storage? Create a Dataflow template? How can one schedule its execution beyond a developer execution it from its development environment?

Update

Preferably without 3rd party tools or a need in additional management tools/infrastructure beyond Google cloud and Dataflow in particular.

2

2 Answers

2
votes

Intuitively you’d expect that “deploying a pipeline” section under How-to guides of the Dataflow documentation will cover that. But you find an explanation of that only 8 sections below in the “templates overview” section.

According to that section:

Cloud Dataflow templates introduce a new development and execution workflow that differs from traditional job execution workflow. The template workflow separates the development step from the staging and execution steps.

Trivially you do not deploy and execute your Dataflow pipeline from Google Cloud. But if you need to share the execution of a pipeline with nontechnical members of your cloud or simply want to trigger it without being dependant on a development environment or 3rd party tools then Dataflow templates is what you need.

Once a pipeline developed and tested you can create a Dataflow job template from it.

Please note that:

To create templates with the Cloud Dataflow SDK 2.x for Python, you must have version 2.0.0 or higher.

You will need to execute your pipeline using DataflowRunner with pipeline options that will generate a template on the Google Cloud storage rather than running it.

For more details refer to creating templates documentation section and to run it from template refer to executing templates section.

2
votes

I'd say the most convenient way is to use Airflow. This allows you to author, schedule, and monitor workflows. The Dataflow Operator can start your designed data pipeline. Airflow can be started either on a small VM, or by using Cloud Composer, which is a tool on the Google Cloud Platform.

There are more options to automate your workflow, such as Jenkins, Azkaban, Rundeck, or even running a simple cronjob (which I'd discourage you to use). You might want to take a look at these options as well, but Airflow probably fits your needs.