How does Dataflow perform Parallel Processing?

Question

I am using dataflow for Mysql to Bigquery data pipeline. I am using JDBC Mysql to Bigquery dataflow template for that.

While creating a job from dataflow GUI, i can explicitely set the maximum number of workers, total number of workers.

But the problem is, if i mention two workers of n1-standard-4 size, 2 worker are created for some time and automatically one worker is deleted. Why both workers are not running for complete operation?

Also there is no difference in elapsed time even if i use 1 or 2 workers. As per my understanding , the time should be half if i use 2 workers instead of one. No of files created in GCS bucket Temp folder are also same.

How does dataflow manages its workers? How it performs parallel processing? How should i decide the number and type of workers needed for my job ?

guillaume blaquiere guillaume blaquiere · Accepted Answer · 2020-11-19T13:04:15

Beam framework implements something similar to Map-Reduce. You can parallelize the Map (ParDo -> For Parallel Do) and you can't parallelize the Reduce (GroupBy) (at least, not all Group By can be parallelize).

So, according to your pipeline, Beam is able to dispatch efficiently the messages to process on each worker in parallel and then to wait to perform the GroupBy. The scalability works great for a complex pipeline, especially if you have several entries and/or several outputs.

In your case, your pipeline is very simple: no transformation (that you could do in parallel). Simply Read and Write. What do you want to parallelize? You don't need to have several workers for this!

A last point: the sink that you use, here BigQuery, can have a different behavior according with your pipeline running mode

If you run your pipeline in batch mode (your case), BigQuery.IO simply takes the data and create file in Cloud Storage staging bucket. Then, at the end, trigger an unique load job of all the files in the correct table
If you run your pipeline in streaming mode, BigQuery.IO will perform a stream write into BigQuery.

This mode can influence the parallelization capacity and the possible number of workers.

How does Dataflow perform Parallel Processing?

2 Answers