2
votes

In Apache Flink, setAutoWatermarkInterval(interval) produces watermarks to downstream operators so that they advance their event time.

If the watermark has not been changed during the specified interval (no events arrived) the runtime will not emit any watermarks? On the other hand, if a new event is arrived before the next interval, a new watermark will be immediately emitted or it will be queued/waiting until the next setAutoWatermarkInterval interval is reached.

I am curious on what is the best configuration AutoWatermarkInterval (especially for high rate sources): the more this value is small, the more lag between processing time and event time will be small, but at the overhead of more BW usage to send the watermarks. Is that true accurate?

On the other hand, If I used env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime), Flink runtime will automatically assign timestamps and watermarks (timestamps correspond to the time the event entered the Flink dataflow pipeline i.e. the source operator), nevertheless even with ingestionTime we can still define a processing time timer (in the processElement function) as show below:

long timer = context.timestamp() + Timeout.
context.timerService().registerProcessingTimeTimer(timer);

where context.timestamp() is the ingestion time set by Flink.

Thank you.

1

1 Answers

2
votes

The autoWatermarkInterval only affects watermark generators that pay attention to it. They also have an opportunity to generate a watermark in combination with event processing.

For those watermark generators that use the autoWatermarkInterval (which is definitely the normal case), they are collecting evidence for what the next watermark should be as a side effect of assigning timestamps for each event. When a timer fires (based on the autoWatermarkInterval), the watermark generator is then asked by the Flink runtime to produce the next watermark. The watermark wasn't waiting somewhere, nor was it queued, but rather it is created on demand, based on information that had been stored by the timestamp assigner -- which is typically the maximum timestamp seen so far in the stream.

Yes, more frequent watermarks means more overhead to communicate and process them, and lower latency. You have to decide how to handle this throughput/latency tradeoff based on your application's requirements.

You can always use processing time timers, regardless of the TimeCharacteristic. (By the way, at a low level, the only thing watermarks do is to trigger event time timers, be they in process functions, windows, etc.)