1
votes

I am trying to implement a pipeline and takes in a stream of data and every minutes output a True if there is any element in the minute interval or False if there is none. The pane (with forever time trigger) or window (fixed window) does not seem to trigger if there is no element for the duration.

One workaround I am thinking is to put the stream into a global window, use a ValueState to keep a queue to accumulate the data and a timer as a trigger to exam the queue. I wonder if there is any neater way of achieving this.

Thanks.

3

3 Answers

2
votes

I think your timers and state solution is a good way to do this. However, keep in mind that your timers will not be set until you receive at least one element for a key.

If this is an issue, then the other thing you could do is inject a PCollection so that every window is guaranteed to have at least one dummy element. Then you can use ValueState to check if any element besides the dummy element has arrived. Or alternatively use Count.PerElement over the window and check if there is more than 1 element(An additional element, that is not the dummy element) for that window.

0
votes

I believe you can achieve this behaviour by setting

.withAllowedLateness(Duration.ZERO, Window.ClosingBehavior.FIRE_ALWAYS)

in your windowing step.

0
votes

I think that the Beam guys call this pattern "Looping timers" (https://beam.apache.org/blog/looping-timers/, https://www.youtube.com/watch?v=Q_v5Zsjuuzg). There are several solutions to this problem and there are some trade-offs.

Be sure to read the fine prints! For example at the date of writing this (Jan '21) Google Cloud Dataflow Runners Drain feature does not support looping timers but it might change in the future: https://beam.apache.org/documentation/runners/capability-matrix/