1
votes

I am using dataflow for Mysql to Bigquery data pipeline. I am using JDBC Mysql to Bigquery dataflow template for that.

While creating a job from dataflow GUI, i can explicitely set the maximum number of workers, total number of workers.

But the problem is, if i mention two workers of n1-standard-4 size, 2 worker are created for some time and automatically one worker is deleted. Why both workers are not running for complete operation?

Also there is no difference in elapsed time even if i use 1 or 2 workers. As per my understanding , the time should be half if i use 2 workers instead of one. No of files created in GCS bucket Temp folder are also same.

How does dataflow manages its workers? How it performs parallel processing? How should i decide the number and type of workers needed for my job ?

2

2 Answers

0
votes

Beam framework implements something similar to Map-Reduce. You can parallelize the Map (ParDo -> For Parallel Do) and you can't parallelize the Reduce (GroupBy) (at least, not all Group By can be parallelize).

So, according to your pipeline, Beam is able to dispatch efficiently the messages to process on each worker in parallel and then to wait to perform the GroupBy. The scalability works great for a complex pipeline, especially if you have several entries and/or several outputs.

In your case, your pipeline is very simple: no transformation (that you could do in parallel). Simply Read and Write. What do you want to parallelize? You don't need to have several workers for this!

A last point: the sink that you use, here BigQuery, can have a different behavior according with your pipeline running mode

  • If you run your pipeline in batch mode (your case), BigQuery.IO simply takes the data and create file in Cloud Storage staging bucket. Then, at the end, trigger an unique load job of all the files in the correct table
  • If you run your pipeline in streaming mode, BigQuery.IO will perform a stream write into BigQuery.

This mode can influence the parallelization capacity and the possible number of workers.

0
votes

There are a couple of plausible reasons for which your Dataflow job does not keep the two workers until the end:

-1st: Either the full job or some task is not parallelisable. Dataflow will remove the second worker in order for you not to incur in additional costs while the worker is idle.

-2nd: If the workers are using on average less than 75% of their CPUs over two minutes, and the streaming pipeline backlog is lower than 10 seconds (1).

Please bear in mind that scaling down does not occur automatically as Dataflow is, in this sense, conservative. Normally, Dataflow will spend more time trying to add workers than using them. It's for that reason that when you expect a high workload with sharp peaks, it is advisable to set a high starting number of workers.

On the other hand, if only 1 of the two workers is being used, the total amount of time will be the same regardless of whether you set the number of workers to 1 or 2. To better understand this concept, let me give an example:

Imagine you have an algorithm that produces a sequence of pseudo-random numbers where each value computation depends on the last number. This is a task where it does not matter if you have 1 or 100 workers, it will always work at the same speed. But at the same time, for other use cases, for example if each number doesn't depend on the previous one, this task would be approximately 100 times faster with 100 workers.

All in all, Dataflow considers the parallalelisability of each task and, depending on the rules stated in (1), scales up and down. A higher number of workers may or may not be faster, but it will be more expensive.

Please take a look at (2) for a better insight on Parallelization and distribution in Dataflow. I've also found these two Stack Overflow questions (3) and (4) that might help shed some light on your question.