I have a dataflow streaming job that is processing data consumed from a pubsub topic and transforms/ writes the data to a bigtable instance. Autoscaling settings are:
autoscalingAlgorithm=THROUGHPUT_BASED
maxNumWorkers=15
The last job had been running for about 1 month (Job ID: 2020-11-22_23_08_42-17274274260380601237). Prior to ~11-12 Dec (2020), it was running okay, i.e. the autoscaling works as expected, with higher throughput (higher CPU utilization) more workers are being utilized, and when the throughput decreases (correspondingly CPU utilization), it scales back to 1 worker. However, since 11-12 Dec, there has been a permanent increase in dataflow workers to the maximum number (15) which don't scale back down, resulting in high cost for our Dataflow usage.
As mentioned in the documentation (Ref: https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#autoscaling): If a streaming pipeline backlog is lower than 10 seconds and workers are utilizing on average less than 75% of the CPUs for a period of a couple minutes, Dataflow scales down. After scaling down, workers utilize on average, 75% of their CPUs. However, since 11-12 Dec this has not been happening. After stabilizing, the CPU utilization for workers are around ~6%, which is way below the levels for scaling down, except it doesn't.
Looking at the throughput traffic for the particular pubsub topic, published messages remained quite consistent over the past 1 month with no particular spikes in traffic. There are also no particular errors with writing to bigtable that I can observe.
Tried redeploying the dataflow streaming job twice with the same effect. Not sure whether anyone else is facing similar issues, appreciate any advice on where else I can look or troubleshoot. Thanks in advance!