0
votes

I have a streaming dataflow reading from a pubsub subscription with no windowing applied. The first step of the pipeline is to read from the pubsub subscription. How does dataflow decide to what count of messages it should accumulate in the first step before emitting those messages to next step and continue reading more incoming mesages at pubsub?

2

2 Answers

2
votes

In the absence of any grouping / combine transforms then its just done based on bundles;

'... processed in bundles. The division of the collection into bundles is arbitrary and selected by the runner. This allows the runner to choose an appropriate middle-ground between persisting results after every element, and having to retry everything if there is a failure. For example, a streaming runner may prefer to process and commit small bundles, and a batch runner may prefer to process larger bundles.'

You can read more about the detail here.

0
votes

If you define yourself a window and trigger policy, a default (global) windows is define with a default trigger (discard late message). You can found this in the documentation

Caution: Beam’s default windowing behavior is to assign all elements of a PCollection to a single, global window and discard late data, even for unbounded PCollections. Before you use a grouping transform such as GroupByKey on an unbounded PCollection, you must do at least one of the following:

  • Set a non-global windowing function. See Setting your PCollection’s windowing function.
  • Set a non-default trigger. This allows the global window to emit results under other conditions, since the default windowing behavior (waiting for all data to arrive) will never occur.

If you don’t set a non-global windowing function or a non-default trigger for your unbounded PCollection and subsequently use a grouping transform such as GroupByKey or Combine, your pipeline will generate an error upon construction and your job will fail.