0
votes

I am reading https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/event_timestamp_extractors.html#fixed-amount-of-lateness, looks it is saying that if t_eventime < t_watermark(less than),then the event is determined as late.

How about when the event time is equal to the watermark? If t_eventime = t_waterwark, then this event is not late?

I have always thought before that if t_eventime <= t_watermark, then the event determined as late.

Could you please show me the code where the determination happens, thanks.

1
could someone take look? Thanks!Tom

1 Answers

1
votes

Indeed, it seems the code to determine if an event is late in that case is using a <= comparison, so an event is considered late if its timestamp + allowed lateness is before or equal to the watermark, i.e. if its lateness is >= 0:

    protected boolean isElementLate(StreamRecord<IN> element) {
        return (windowAssigner.isEventTime())
                && (element.getTimestamp() + allowedLateness
                        <= internalTimerService.currentWatermark());
    }

Now for completeness sake, note that the value of the watermark itself for the strategy you are referring to, is computed as the latest timestamp ever seen, minus the "out of orderness", minus 1.

    public void onPeriodicEmit(WatermarkOutput output) {
        output.emitWatermark(new Watermark(maxTimestamp - outOfOrdernessMillis - 1));
    }

That means the actual lateness that is used for the comparison in the first snippet above is 1 millisecond more then what intuition might tell us, s.t. lateness > 0 might actually be what us humans need to read for understanding what's going on.

Now those parameters are meant as estimates of how out of order we believe our data might be in the real world, maybe due to network race conditions or non aligned clocks or so, and we are typically much less precise than 1 ms when making that estimation. So it all should not matter all that much: hopefully such occurrences are rare in our data, though how many there are is typically a bit random due to the data itself.