Is there anyway to trigger early output of windows when running in batch mode? I've tried a number of triggers with the Dataflow runner to get early window output, but they are always held until the end of processing.
2 Answers
Unlike Streaming, Dataflow batch always executes entire (fused) stages to completion in topological order (including GroupByKey). As such, once it starts processing key after the GBK, it always has all the values for that key and calls the downstream operations exactly once with the key-values pair. Triggers in Beam are a lower bound on how soon data for a window can be released, but do not force an early release (hence the names AfterCount, AfterWatermark, etc.), and as such the batch model technically satisfies the contract with one and only one "firing."
It is not possible to get early window output on Dataflow (or any other runner that I know of) in batch mode.
It totally depends on the type of operation you are performing. Suppose you are performing aggregate operation in that case it will held the result until the end of processing step of dataflow. Otherwise Dataflow will release the output as soon as the processing part complete, it will not wait until the complete processing stage. If possible post your code so that I can debug the code part.