3
votes

We have a development in kafka-streams which produces a time window aggregation kind of:

selectKey().groupByKey().aggregate()

and then using

TimeWindows.of().until()

My main question is what happens if until is not used, let's imagine that we have 1 minute windows and for some unforeseen reason a new event arrives from 1 week ago, does the application save all the windows state from the beginning?, wouldn't it produce an excessive consumption of memory or, in case of persisting the state of all the windows, wouldn't it suppose a significant delay to recover the appropriate window?

1

1 Answers

2
votes

We faced the same question recently. The answer can be found in this source file for kafka streams:

https://github.com/apache/kafka/blob/1.1/streams/src/main/java/org/apache/kafka/streams/kstream/Windows.java

Which contains:

static final long DEFAULT_MAINTAIN_DURATION_MS = 24 * 60 * 60 * 1000L; // one day

So without specifying an until() setting, your windowed state store will keep records for (a lower bound of) one day, by default.

The other part of your question is: what happens to events that arrive late, past the time when the window has been expired? That answer is in the developer guide:

In the Kafka Streams DSL users can specify a retention period for the window. This allows Kafka Streams to retain old window buckets for a period of time in order to wait for the late arrival of records whose timestamps fall within the window interval. If a record arrives after the retention period has passed, the record cannot be processed and is dropped.

Piecing that information together shows that if you don't specify an until() setting on your windowed stream, the records will be kept for at least one day, and records arriving more than one day late will be dropped.