3
votes

In the documentation on https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking, an example is shown using a window of 10 minutes, using a watermark of 10 minutes and a trigger of 5 minutes.

In the diagram when using the APPEND mode, the first results form the 12:00:00->12:10:00 window are only shown at 12:25:00. The reason is that at that time, the watermark is at 12:11:00 so all windows before 12:11:00 can already be sent to sink.

However, at 12:20:00, we already know the watermark is 12:11:00. So why isn't the first window not sent at 12:20:00 instead of 12:25:00?

1

1 Answers

1
votes

Because Spark applies global watermark instead of watermark for each partition: watermark for a next batch is decided when tasks in current batch "finishes". Each partition is no idea to decide watermark: it only knows about events in its partition.

So at 12:20:00, Spark gets 12:21:00 and process it, and at the end of batch, Spark collects the events' timestamp and determines max timestamp, and decides watermark for a next batch - "12:11:00" - which will be the watermark for a batch 12:25:00.