How to run a Python Google Cloud Dataflow job with a custom Docker image?

Question

I would like to run a Python Google Cloud Dataflow job with a custom Docker image.

Based on the documentation this should be possible: https://beam.apache.org/documentation/runtime/environments/#testing-customized-images

To try this functionality I have setup the basic wordcount example pipeline with the command line options from the docs in this public repo https://github.com/swartchris8/beam_wordcount_with_docker

I can run the wordcount job with the PortableRunner locally with the apachebeam/python3.6_sdk image but using Dataflow I am unable to do this.

I am following the docs as closely as I can for the PortableRunner my args are:

python -m wordcount --input wordcount.py \
--output counts \
--runner=PortableRunner \
--job_endpoint=embed \
--environment_config=apachebeam/python3.6_sdk

For Dataflow:

python -m wordcount --input wordcount.py \
--output gs://healx-pubmed-ingestion-tmp/test/wordcount/count/count \\
--runner=DataflowRunner \
--project=healx-pubmed-ingestion \
--job_name=dataflow-wordcount-docker \
--temp_location=gs://healx-pubmed-ingestion-tmp/test/wordcount/tmp \
--experiment=beam_fn_api \
--sdk_location=/Users/chris/beam/sdks/python/container/py36/build/target/apache-beam.tar.gz \
--worker_harness_container_image=apachebeam/python3.6_sdk \
--region europe-west1 \
--zone europe-west1-c

For complete details please see the linked repo.

What am I doing wrong here or is this feature unsupported for Python jobs in Dataflow?

Valentyn Valentyn · Accepted Answer · 2020-10-23T02:05:52

You should be able to use custom containers with Dataflow with an --experiment=--use_runner_v2, which will soon be enabled by default. A sample command line may look like:

pip install apache-beam[gcp]==2.24.0
python -m apache_beam.examples.wordcount \
--output gs://healx-pubmed-ingestion-tmp/test/wordcount/ \
--runner=DataflowRunner \
--project=healx-pubmed-ingestion \
--region europe-west1 \
--temp_location=gs://healx-pubmed-ingestion-tmp/test/wordcount/tmp \
--worker_harness_container_image=apache/beam_python3.6_sdk:2.24.0 \
--experiment=use_runner_v2

To customize containers, follow instructions on https://beam.apache.org/documentation/runtime/environments/#customizing-container-images.

How to run a Python Google Cloud Dataflow job with a custom Docker image?

2 Answers