Dataflow streaming job not scaling down

Question

I have a dataflow streaming job that is processing data consumed from a pubsub topic and transforms/ writes the data to a bigtable instance. Autoscaling settings are:

autoscalingAlgorithm=THROUGHPUT_BASED
maxNumWorkers=15

The last job had been running for about 1 month (Job ID: 2020-11-22_23_08_42-17274274260380601237). Prior to ~11-12 Dec (2020), it was running okay, i.e. the autoscaling works as expected, with higher throughput (higher CPU utilization) more workers are being utilized, and when the throughput decreases (correspondingly CPU utilization), it scales back to 1 worker. However, since 11-12 Dec, there has been a permanent increase in dataflow workers to the maximum number (15) which don't scale back down, resulting in high cost for our Dataflow usage.

As mentioned in the documentation (Ref: https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#autoscaling): If a streaming pipeline backlog is lower than 10 seconds and workers are utilizing on average less than 75% of the CPUs for a period of a couple minutes, Dataflow scales down. After scaling down, workers utilize on average, 75% of their CPUs. However, since 11-12 Dec this has not been happening. After stabilizing, the CPU utilization for workers are around ~6%, which is way below the levels for scaling down, except it doesn't.

Looking at the throughput traffic for the particular pubsub topic, published messages remained quite consistent over the past 1 month with no particular spikes in traffic. There are also no particular errors with writing to bigtable that I can observe.

Tried redeploying the dataflow streaming job twice with the same effect. Not sure whether anyone else is facing similar issues, appreciate any advice on where else I can look or troubleshoot. Thanks in advance!

Check pub/usb backlog metric at cloud monitoring. Is it lower than 10 sec? — kallusis369
Yes, it looks like the oldest unacked message is something like 1-2s — jlyh
Do you use windows in your pipeline? If so, which type? Do you use feature such as allow lateness (if so, what's the value) or aggregation such as average? — guillaume blaquiere
I use fixed_window type and allow_lateness set to 7 days. I have the same problem. Can you give me some advices? — sees
I'm not currently doing any windowing or batching, its a purely streaming job.. — jlyh

kallusis369 kallusis369 · Accepted Answer · 2020-12-22T06:30:04

I searched Google dataflow document, and found a clue.

According to document, Dataflow tends to balance between the number of workers and the number of persistent disks.

This could be a cause of problem.

If you don't use Streaming Engine, the Dataflow service allocates between 1 to 15 Persistent Disks to each worker.

When your Dataflow needs specific number of worker node(9~14), It tends to be maintain 15 worker nodes because persistent disk will be distributed equally at 15 worker nodes(1 PD per nodes).

Before 11-12 DEC, Your Dataflow would have needed 8 or less number of workers so downscale process succeed because PD will be distributed well at that numbers.

This process is described in Dataflow document.

For example, if your pipeline needs 3 or 4 workers in steady state, you could set --maxNumWorkers=15. The pipeline automatically scales between 1 and 15 workers, using 1, 2, 3, 4, 5, 8, or 15 workers, which corresponds to 15, 8, 5, 4, 3, 2, or 1 Persistent Disks per worker, respectively.

As a result, Your Dataflow have maintained the number of worker nodes at 15 until it needs 8 or less workers nodes.

To resolve this problem, Raise your maxNumworkers more than 15.

It will be incur slight increase in cost, but downscale will be successfully proceed.

Dataflow streaming job not scaling down

2 Answers