How does streaming Dataflow decide how/when to propagate elements to downstream transforms?

Question

I have a streaming dataflow reading from a pubsub subscription with no windowing applied. The first step of the pipeline is to read from the pubsub subscription. How does dataflow decide to what count of messages it should accumulate in the first step before emitting those messages to next step and continue reading more incoming mesages at pubsub?

Reza Rokni Reza Rokni · Accepted Answer · 2020-09-28T15:12:36

In the absence of any grouping / combine transforms then its just done based on bundles;

'... processed in bundles. The division of the collection into bundles is arbitrary and selected by the runner. This allows the runner to choose an appropriate middle-ground between persisting results after every element, and having to retry everything if there is a failure. For example, a streaming runner may prefer to process and commit small bundles, and a batch runner may prefer to process larger bundles.'

You can read more about the detail here.

How does streaming Dataflow decide how/when to propagate elements to downstream transforms?

2 Answers