We are using Python SDK for apache beam within the Google Dataflow enviroment. The tool is amazing however we are concerned on privacy issues of those jobs, as it look like it uses Public IPs to run workers. Our questions are:
- Shall we concerned about using public IPS even after we specified the Network and subnetwork?
- What exactly is the difference perfomance and security wise of restricting the public IPs?
- How could we set up Dataflow to create all workers on private IP's? Theoretically in the following template we set up the flow to do not allow that behavior(still it does it)! according to the docs:
Our job template looks like this:
options = PipelineOptions(flags = ['--requirements_file', './requirements.txt'])
#GoogleCloud options
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT
google_cloud_options.job_name = job_name
google_cloud_options.staging_location = 'gs://{}/staging'.format(BUCKET)
google_cloud_options.temp_location = 'gs://{}/temp'.format(BUCKET)
google_cloud_options.region = REGION
#Worker options
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = NETWORK
worker_options.max_num_workers = 25
options.view_as(StandardOptions).runner = RUNNER
### Note that we specified worker_options.subnetwork with our personal subnetwork. However, once we run our job it still looks like it creates workers on public ips.
### The code runs like this in the end
p = beam.Pipeline(options = options)
...
...
...
run = p.run()
run.wait_until_finish()
Thanks!