Private IPs to run Google Dataflow(Apache Beam jobs)

Question

We are using Python SDK for apache beam within the Google Dataflow enviroment. The tool is amazing however we are concerned on privacy issues of those jobs, as it look like it uses Public IPs to run workers. Our questions are:

Shall we concerned about using public IPS even after we specified the Network and subnetwork?
What exactly is the difference perfomance and security wise of restricting the public IPs?
How could we set up Dataflow to create all workers on private IP's? Theoretically in the following template we set up the flow to do not allow that behavior(still it does it)! according to the docs:

Our job template looks like this:

options = PipelineOptions(flags = ['--requirements_file', './requirements.txt'])

#GoogleCloud options

google_cloud_options = options.view_as(GoogleCloudOptions)

google_cloud_options.project = PROJECT

google_cloud_options.job_name = job_name

google_cloud_options.staging_location = 'gs://{}/staging'.format(BUCKET)

google_cloud_options.temp_location = 'gs://{}/temp'.format(BUCKET)

google_cloud_options.region = REGION


#Worker options

worker_options = options.view_as(WorkerOptions)

worker_options.subnetwork = NETWORK

worker_options.max_num_workers = 25


options.view_as(StandardOptions).runner = RUNNER




 ### Note that we specified worker_options.subnetwork with our personal subnetwork. However, once we run our job it still looks like it creates workers on public ips.


### The code runs like this in the end

p = beam.Pipeline(options = options)

...

...

...

run = p.run()

run.wait_until_finish()

Thanks!

robertwb robertwb · Accepted Answer · 2021-05-05T00:16:19

You also need to pass the --no_use_public_ips option, see https://cloud.google.com/dataflow/docs/guides/specifying-networks#python

Private IPs to run Google Dataflow(Apache Beam jobs)

1 Answers