Deploying Dataflow in a CI pipeline

Question

I've written a streaming Google Dataflow pipeline in python using the beam SDK. There's documentation about how I run this locally and set the -runner flag to run it on Dataflow.

I'm now trying to automate the deployment of this to a CI pipeline (bitbucket pipelines but not really relevant). There is documentation on how to 'run' a pipeline, but not really 'deploy' it. The commands I've tested with look like:

python -m dataflow --runner "DataflowRunner" \
                   --jobName "<jobName>" \
                   --topic "<pub-sub-topic"" \
                   --project "<project>" \
                   --dataset "<dataset>" \
                   --worker_machine_type "n1-standard-2" \
                   --temp_location "gs://<bucket-name>/tmp/"

This will run the job, but because it's streaming it will never return. It also internally manages the packaging and pushing to a bucket. I know if I kill that process it keeps running, but setting that up on a CI server in a way where I can detect whether the process actually succeeded or I just killed it after some timeout is difficult.

This seems ridiculous and like I'm missing something obvious, but how do I package and run this module on dataflow in a way I can reliably know it deployed from a CI pipeline?

jamielennox jamielennox · Accepted Answer · 2018-11-06T02:41:14

So yes, it was something dumb.

Basically when you use the

with beam.Pipeline(options=options) as p:

syntax, under the hood it's calling wait_until_finish. So the wait was being invoked without me realizing, causing it to hang around forever. Refactoring to remove the context manager fixes the problem.

Deploying Dataflow in a CI pipeline

2 Answers