1
votes

What I understand so far is that there 3 ways of dealing with late data in Flink :

  • Dropping Late Events ( which is the default behavior for event-time window operators. Hence, a late arriving element will not create a new window.)

  • Redirecting Late Events (Late events can also be redirected into another DataStream using the side-output feature)

  • Updating Results by Including Late Events (recompute an incomplete result and emit an update)

I don't quite well understand what happen for late event for non-window operator, especially when the timestamp is assigned at the source. Here I have a FlinkKafkaConsumer :

new FlinkKafkaConsumer(
          liveTopic,
          deserializer,
          config.toProps
        ).setStartFromTimestamp(startOffsetTimestamp)
          .assignTimestampsAndWatermarks(
            WatermarkStrategy
              .forBoundedOutOfOrderness[String](Duration.ofSeconds(20))
          )

If some data is out-of-order inside my Kafka partition, let's say 1 minute late in term of timestamp attached to a record, will this data be discarded when consumed by Flink ? Can I configure some kind of allowedLatness (like with window operator) ?

1

1 Answers

2
votes

The only operators that drop late events are those that must make a time-based decision about how to process each event. Thus, by default, event-time-based windows and CEP drop late events (CEP does so because it has to first do a time-based sort of the event stream, and late events have missed their chance to be sorted into place). In both of those cases, those APIs offer a stream of late events as a side output channel.

Flink SQL's temporal operators also drop late events. So far the Table/SQL doesn't offer any way to capture or accommodate those late events (without resorting to using the DataStream API).

But all other operators else simply operate on events without paying any attention to their lateness. And in a ProcessFunction you can examine the timestamp and compare it to the current watermark, and make your own decision about how to handle late events.