0
votes

When using sliding time window in Apache Flink, a number of tuples/elements in the window are recomputed as the window slides. For instance, assuming a window of size 5 seconds with slide of 1 second, 80% of the window contents are same as that of the last window.

window(SlidingEventTimeWindows.of(Time.seconds(5), Time.seconds(1)))

Consider a data stream S, whose tuples consist of a timestamp and an integer value: <t1,12>, <t2,3>, <t3,15>, <t4,7>, <t5,9>, <t6,18>, <t7,2>, ...

Assuming t1, t2, t3, ... denote consecutive timestamps, where t2-t1 = 1 second. Given S, the Flink windowed ProcessWindowFunction with window of size 5 seconds and slide 1 second gets tuples as follows:

Window1: <t1,12>, <t2,3>, <t3,15>, <t4,7>, <t5,9>
Window2: <t2,3>, <t3,15>, <t4,7>, <t5,9>, <t6,18>
Window3: <t3,15>, <t4,7>, <t5,9>, <t6,18>, <t7,2>
...

Although I could use state variables to store the result of previous overlapping window computation, I could not find a way to filter out the overlapping tuples in the next window.

One solution I could think is to utilize the last window ending timestamp to ignore the computation in the current ProcessWindowFunction, but doing so saves only a little computation as the tuples are already in the ProcessWindowFunction. Is there any way to filter out the overlapping tuples before reaching the ProcessWindowFunction?

1

1 Answers

0
votes

I don't get what is the problem at hand: performance? or just having non overlapping tuples? So I'll answer both:


Having non overlapping tuples

Seems like you need:

window(TumblingEventTimeWindows.of(Time.seconds(1)))

Performance

Indeed windows slices overlap, and some computation / state could be preserved. Some researchers have begun to tackle the problem with "Scotty: Efficient Window Aggregation for out-of-order Stream Processing".

I believe it does work in Flink, but in as a separate library. We're all waiting that a charitable soul merges their work in Flink.