Processing is not distributed by DataFlow, but is processed by one node.
I created the following program and validate that it works correctly with a small data.
Read data from BigQuery → Dofn processing → Combine processing → Flatten processing → Combine processing → Flatten processing → Write data to Bigquery.
Next, I will test with large data that it works in a multi-node parallel environment.
Specify numWorkers
and --autoscalingAlgorithm = NONE
as optional parameters at Dataflow startup.
Since it takes a very long time to execute, I will investigate.
- I checked the execution situation with job view of Dataflow. It takes time to process combination.
- Inspect computer metrics on GCE VM instances. One computer is consuming and running resources, while the other computers are idle.
- the log with StackDriver. The addInput process in the Combine process is being performed on a single active computer, which was checked using metrics earlier.
- When I look at other logs in StackDriver, I occasionally see the message Discarding invalid work item null on an idle computer.
By the way, if I do not specify the startup options numWorkers
and --autoscalingAlgorithm = NONE
, that is auto scaling, you will only start one node.
Dataflow thought that writing a program according to beam's idiom would be distributed to many nodes in a "good" way, but it works differently than expected.
How does it work well?
Reshuffle.viaRandomKey()
between the first Flatten and second Combine? That might spread the data across workers. – Udi Meiri