0
votes

Google Cloud Dataflow is based on apache beam.And beam does not officially support java11. But when I run a dataflow job on GCP and check the vm instance that the job using as a worker. I found that the container image is "gcr.io/cloud-dataflow/v1beta3/beam-java11-batch:beam-2.23.0"". So is dataflow using java11 as the java runtime when running dataflow? Why not use java8? Is it a risk to have bugs?

"spec": { "containers": [ { "args": [ "--physmem_limit_pct=70", "--log_file=/var/log/dataflow/boot-json.log", "--log_dir=/var/log/dataflow", "--work_dir=/var/opt/google/dataflow", "--tmp_dir=/var/opt/google/tmp", "--endpoint=https://dataflow.googleapis.com/" ], "image": "gcr.io/cloud-dataflow/v1beta3/beam-java11-batch:beam-2.23.0",

1
I think you should ask Google this question. And if you are (really) concerned and want a practical solution, look at an earlier version. Anything from 2.17.0 SDK onwards ... as of right now. Though some versions are due to be deprecated in early 2021. Source: cloud.google.com/dataflow/docs/support/…Stephen C
FWIW - there is always a risk that software has bugs. Officially supported doesn't mean "no bugs".Stephen C

1 Answers

1
votes

The "Dataflow Runner" (the part of Apache Beam that translates a Beam pipeline to Dataflow's representation and submits the job) detects what version of Java you are using to submit the job and attempts to match it. So if you are launching your pipeline with Java 11, the worker chosen will be Java 11.

You can manually choose a container by passing the --workerHarnessContainerImage flag. This is not "supported" because it is easy to cause a job to fail in ways that Dataflow cannot control.