1
votes

In the Dataflow Monitoring Interface for Beam Pipeline Executions, there is a time duration specified in each of the Transformation boxes (see https://cloud.google.com/dataflow/pipelines/dataflow-monitoring-intf).

For bounded data, I understood this is the estimated time it would take for the transformation to be completed. However, for unbounded data as in my streaming case, how do I interpret this number?

Some of my transforms have a duration significantly higher than the others, and this means that the transform takes more time. But what are the other implications regarding how this uneven distribution affects my execution, especially if I have a windowing functions going on?

Also, is this related to autoscaling? For e.g. do more workers get spun up if the time taken for execution exceeds certain thresholds? Or does autoscaling depend on data volume at the input?

1

1 Answers

2
votes

In both Batch and Streaming this is a measure of how long those steps have spent active on each work thread. The number of threads for each worker machine varies between Batch and Streaming, and as you note more workers means more worker threads.

There aren't any actual implications -- these measurements are provided as a way of understanding what the work threads have spent most of their time doing. If the total pipeline seems to be behaving reasonably, you don't need to do anything. If you think that the pipeline is slower than you expect, or if one of the steps seems to be taking longer than you would expect, these can act as a starting point to understanding performance.

In some sense these are similar to how a profile of time spent in various functions can be useful for improving the performance of a normal program. There isn't any impact to one function taking longer than another, but it may be useful information to have.