0
votes

it seems like Apache Flink would not handle well two events with the same timestamp in certain scenarios.

According to the docs a Watermark of t indicates any new events will have a timestamp strictly greater than t. Unless you can completely discard the possibility of two events having the same timestamp then you will not be safe to ever emit a Watermark of t. Enforcing distinct timestamps also limits the number of events per second a system can process to 1000.

Is this really an issue in Apache Flink or is there a workaround?

For those of you that'd like a concrete example to play with, my use case is to build a hourly aggregated rolling word count for an event time ordered stream. For the data sample that I copied in a file (notice the duplicate 9):

mario 0
luigi 1
mario 2
mario 3
vilma 4
fred 5
bob 6
bob 7
mario 8
dan 9
dylan 9
dylan 11
fred 12
mario 13
mario 14
carl 15
bambam 16
summer 17
anna 18
anna 19
edu 20
anna 21
anna 22
anna 23
anna 24
anna 25

And the code:

public static void main(String[] args) throws Exception {
    final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment()
            .setParallelism(1)
            .setMaxParallelism(1);

    env.setStreamTimeCharacteristic(EventTime);


    String fileLocation = "full file path here";
    DataStreamSource<String> rawInput = env.readFile(new TextInputFormat(new Path(fileLocation)), fileLocation);

    rawInput.flatMap(parse())
            .assignTimestampsAndWatermarks(new AssignerWithPunctuatedWatermarks<TimestampedWord>() {
                @Nullable
                @Override
                public Watermark checkAndGetNextWatermark(TimestampedWord lastElement, long extractedTimestamp) {
                    return new Watermark(extractedTimestamp);
                }

                @Override
                public long extractTimestamp(TimestampedWord element, long previousElementTimestamp) {
                    return element.getTimestamp();
                }
            })
            .keyBy(TimestampedWord::getWord)
            .process(new KeyedProcessFunction<String, TimestampedWord, Tuple3<String, Long, Long>>() {
                private transient ValueState<Long> count;

                @Override
                public void open(Configuration parameters) throws Exception {
                    count = getRuntimeContext().getState(new ValueStateDescriptor<>("counter", Long.class));
                }

                @Override
                public void processElement(TimestampedWord value, Context ctx, Collector<Tuple3<String, Long, Long>> out) throws Exception {
                    if (count.value() == null) {
                        count.update(0L);
                        setTimer(ctx.timerService(), value.getTimestamp());
                    }

                    count.update(count.value() + 1);
                }

                @Override
                public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple3<String, Long, Long>> out) throws Exception {
                    long currentWatermark = ctx.timerService().currentWatermark();
                    out.collect(new Tuple3(ctx.getCurrentKey(), count.value(), currentWatermark));
                    if (currentWatermark < Long.MAX_VALUE) {
                        setTimer(ctx.timerService(), currentWatermark);
                    }
                }

                private void setTimer(TimerService service, long t) {
                    service.registerEventTimeTimer(((t / 10) + 1) * 10);
                }
            })
            .addSink(new PrintlnSink());

    env.execute();
}

private static FlatMapFunction<String, TimestampedWord> parse() {
    return new FlatMapFunction<String, TimestampedWord>() {
        @Override
        public void flatMap(String value, Collector<TimestampedWord> out) {
            String[] wordsAndTimes = value.split(" ");
            out.collect(new TimestampedWord(wordsAndTimes[0], Long.parseLong(wordsAndTimes[1])));
        }
    };
}

private static class TimestampedWord {
    private final String word;
    private final long timestamp;

    private TimestampedWord(String word, long timestamp) {
        this.word = word;
        this.timestamp = timestamp;
    }

    public String getWord() {
        return word;
    }

    public long getTimestamp() {
        return timestamp;
    }
}

private static class PrintlnSink implements org.apache.flink.streaming.api.functions.sink.SinkFunction<Tuple3<String, Long, Long>> {
    @Override
    public void invoke(Tuple3<String, Long, Long> value, Context context) throws Exception {
        long timestamp = value.getField(2);
        System.out.println(value.getField(0) + "=" + value.getField(1) + " at " + (timestamp - 10) + "-" + (timestamp - 1));
    }
}

I get

    mario=4 at 1-10
    dylan=2 at 1-10
    luigi=1 at 1-10
    fred=1 at 1-10
    bob=2 at 1-10
    vilma=1 at 1-10
    dan=1 at 1-10
    vilma=1 at 10-19
    luigi=1 at 10-19
    mario=6 at 10-19
    carl=1 at 10-19
    bambam=1 at 10-19
    dylan=2 at 10-19
    summer=1 at 10-19
    anna=2 at 10-19
    bob=2 at 10-19
    fred=2 at 10-19
    dan=1 at 10-19
    fred=2 at 9223372036854775797-9223372036854775806
    dan=1 at 9223372036854775797-9223372036854775806
    carl=1 at 9223372036854775797-9223372036854775806
    mario=6 at 9223372036854775797-9223372036854775806
    vilma=1 at 9223372036854775797-9223372036854775806
    edu=1 at 9223372036854775797-9223372036854775806
    anna=7 at 9223372036854775797-9223372036854775806
    summer=1 at 9223372036854775797-9223372036854775806
    bambam=1 at 9223372036854775797-9223372036854775806
    luigi=1 at 9223372036854775797-9223372036854775806
    bob=2 at 9223372036854775797-9223372036854775806
    dylan=2 at 9223372036854775797-9223372036854775806

Notice dylan=2 at 0-9 where it should be 1.

1

1 Answers

2
votes

No, there isn't a problem with having stream elements with the same timestamp. But a Watermark is an assertion that all events that follow will have timestamps greater than the watermark, so this does mean that you cannot safely emit a Watermark t for a stream element at time t, unless the timestamps in the stream are strictly monotonically increasing -- which is not the case if there are multiple events with the same timestamp. This is why the AscendingTimestampExtractor produces watermarks equal to currentTimestamp - 1, and you should do the same.

Notice that your application is actually reporting that dylan=2 at 0-10, not at 0-9. This is because the watermark resulting from dylan at time 11 is triggering the first timer (the timer set for time 10, but since there is no element with a timestamp of 10, that timer doesn't fire until the watermark from "dylan 11" arrives). And your PrintlnSink uses timestamp - 1 to indicate the upper end of the timespan, hence 11 - 1, or 10, rather than 9.

There's nothing wrong with the output of your ProcessFunction, which looks like this:

(mario,4,11)
(dylan,2,11)
(luigi,1,11)
(fred,1,11)
(bob,2,11)
(vilma,1,11)
(dan,1,11)
(vilma,1,20)
(luigi,1,20)
(mario,6,20)
(carl,1,20)
(bambam,1,20)
(dylan,2,20)
...

It is true that by time 11 there have been two dylans. But the report produced by PrintlnSink is misleading.

Two things need to be changed to get your example working as intended. First, the watermarks need to satisfy the watermarking contract, which isn't currently the case, and second, the windowing logic isn't quite right. The ProcessFunction needs to be prepared for the "dylan 11" event to arrive before the timer closing the window for 0-9 has fired. This is because the "dylan 11" stream element precedes the watermark generated from it in the stream.

Update: events whose timestamps are beyond the current window (such as "dylan 11") can be handled by

  1. keep track of when the current window ends
  2. rather than incrementing the counter, add events for times after the current window to a list
  3. after a window ends, consume events from that list that fall into the next window