python - Executing a pipeline only after another one finishes on google dataflow

Question

I want to run a pipeline on google dataflow that depends on the output of another pipeline. Right now I am just running two pipelines after each other with the DirectRunner locally:

with beam.Pipeline(options=pipeline_options) as p:
    (p
     | beam.io.ReadFromText(known_args.input)
     | SomeTransform()
     | beam.io.WriteToText('temp'))

with beam.Pipeline(options=pipeline_options) as p:
    (p
     | beam.io.ReadFromText('temp*')
     | AnotherTransform()
     | beam.io.WriteToText(known_args.output))

My questions are the following:

Does the DataflowRunner guarantee that the second is only started after the first pipeline completed?
Is there a preferred method to run two pipelines consequetively after each other?
Also is there a recommended way to separate those pipelines into different files in order to test them better?

Andrew Nguonly Andrew Nguonly · Accepted Answer · 2018-03-09T17:00:55

Does the DataflowRunner guarantee that the second is only started after the first pipeline completed?

No, Dataflow simply executes a pipeline. It has no features for managing dependent pipeline executions.

Update: For clarification, Apache Beam does provide a mechanism for waiting for a pipeline to complete execution. See the waitUntilFinish() method of the PipelineResult class. Reference: PipelineResult.waitUntilFinish().

Is there a preferred method to run two pipelines consequetively after each other?

Consider using a tool like Apache Airflow for managing dependent pipelines. You can even implement a simple bash script to deploy one pipeline after another one has finished.

Also is there a recommended way to separate those pipelines into different files in order to test them better?

Yes, separate files. This is just good code organization, not necessarily better for testing.

python - Executing a pipeline only after another one finishes on google dataflow

1 Answers