I want to run a pipeline on google dataflow that depends on the output of another pipeline. Right now I am just running two pipelines after each other with the DirectRunner locally:
with beam.Pipeline(options=pipeline_options) as p:
(p
| beam.io.ReadFromText(known_args.input)
| SomeTransform()
| beam.io.WriteToText('temp'))
with beam.Pipeline(options=pipeline_options) as p:
(p
| beam.io.ReadFromText('temp*')
| AnotherTransform()
| beam.io.WriteToText(known_args.output))
My questions are the following:
- Does the DataflowRunner guarantee that the second is only started after the first pipeline completed?
- Is there a preferred method to run two pipelines consequetively after each other?
- Also is there a recommended way to separate those pipelines into different files in order to test them better?