How spark structured streaming calculate watermark

Question

However, to run this query for days, it’s necessary for the system to bound the amount of intermediate in-memory state it accumulates. This means the system needs to know when an old aggregate can be dropped from the in-memory state because the application is not going to receive late data for that aggregate any more. To enable this, in Spark 2.1, we have introduced watermarking, which lets the engine automatically track the current event time in the data and attempt to clean up old state accordingly. You can define the watermark of a query by specifying the event time column and the threshold on how late the data is expected to be in terms of event time. For a specific window starting at time T, the engine will maintain state and allow late data to update the state until (max event time seen by the engine - late threshold > T). In other words, late data within the threshold will be aggregated, but data later than the threshold will start getting dropped (see later in the section for the exact guarantees). Let’s understand this with an example. We can easily define watermarking on the previous example using withWatermark() as shown below.

Above is copied from http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html.

From the above documenation description(For a specific window starting at time T)，it is the starting time of a given window.

I think the document is wrong, it should be the ending time of a given window.

Tom Tom · Accepted Answer · 2018-08-08T00:15:45

I confirm by investigating the spark code， the document is wrong, T is the ending time of the window

How spark structured streaming calculate watermark

1 Answers