it seems like Apache Flink would not handle well two events with the same timestamp in certain scenarios.
According to the docs a Watermark of t
indicates any new events will have a timestamp strictly greater than t
. Unless you can completely discard the possibility of two events having the same timestamp then you will not be safe to ever emit a Watermark of t
. Enforcing distinct timestamps also limits the number of events per second a system can process to 1000.
Is this really an issue in Apache Flink or is there a workaround?
For those of you that'd like a concrete example to play with, my use case is to build a hourly aggregated rolling word count for an event time ordered stream. For the data sample that I copied in a file (notice the duplicate 9):
mario 0
luigi 1
mario 2
mario 3
vilma 4
fred 5
bob 6
bob 7
mario 8
dan 9
dylan 9
dylan 11
fred 12
mario 13
mario 14
carl 15
bambam 16
summer 17
anna 18
anna 19
edu 20
anna 21
anna 22
anna 23
anna 24
anna 25
And the code:
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment()
.setParallelism(1)
.setMaxParallelism(1);
env.setStreamTimeCharacteristic(EventTime);
String fileLocation = "full file path here";
DataStreamSource<String> rawInput = env.readFile(new TextInputFormat(new Path(fileLocation)), fileLocation);
rawInput.flatMap(parse())
.assignTimestampsAndWatermarks(new AssignerWithPunctuatedWatermarks<TimestampedWord>() {
@Nullable
@Override
public Watermark checkAndGetNextWatermark(TimestampedWord lastElement, long extractedTimestamp) {
return new Watermark(extractedTimestamp);
}
@Override
public long extractTimestamp(TimestampedWord element, long previousElementTimestamp) {
return element.getTimestamp();
}
})
.keyBy(TimestampedWord::getWord)
.process(new KeyedProcessFunction<String, TimestampedWord, Tuple3<String, Long, Long>>() {
private transient ValueState<Long> count;
@Override
public void open(Configuration parameters) throws Exception {
count = getRuntimeContext().getState(new ValueStateDescriptor<>("counter", Long.class));
}
@Override
public void processElement(TimestampedWord value, Context ctx, Collector<Tuple3<String, Long, Long>> out) throws Exception {
if (count.value() == null) {
count.update(0L);
setTimer(ctx.timerService(), value.getTimestamp());
}
count.update(count.value() + 1);
}
@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple3<String, Long, Long>> out) throws Exception {
long currentWatermark = ctx.timerService().currentWatermark();
out.collect(new Tuple3(ctx.getCurrentKey(), count.value(), currentWatermark));
if (currentWatermark < Long.MAX_VALUE) {
setTimer(ctx.timerService(), currentWatermark);
}
}
private void setTimer(TimerService service, long t) {
service.registerEventTimeTimer(((t / 10) + 1) * 10);
}
})
.addSink(new PrintlnSink());
env.execute();
}
private static FlatMapFunction<String, TimestampedWord> parse() {
return new FlatMapFunction<String, TimestampedWord>() {
@Override
public void flatMap(String value, Collector<TimestampedWord> out) {
String[] wordsAndTimes = value.split(" ");
out.collect(new TimestampedWord(wordsAndTimes[0], Long.parseLong(wordsAndTimes[1])));
}
};
}
private static class TimestampedWord {
private final String word;
private final long timestamp;
private TimestampedWord(String word, long timestamp) {
this.word = word;
this.timestamp = timestamp;
}
public String getWord() {
return word;
}
public long getTimestamp() {
return timestamp;
}
}
private static class PrintlnSink implements org.apache.flink.streaming.api.functions.sink.SinkFunction<Tuple3<String, Long, Long>> {
@Override
public void invoke(Tuple3<String, Long, Long> value, Context context) throws Exception {
long timestamp = value.getField(2);
System.out.println(value.getField(0) + "=" + value.getField(1) + " at " + (timestamp - 10) + "-" + (timestamp - 1));
}
}
I get
mario=4 at 1-10
dylan=2 at 1-10
luigi=1 at 1-10
fred=1 at 1-10
bob=2 at 1-10
vilma=1 at 1-10
dan=1 at 1-10
vilma=1 at 10-19
luigi=1 at 10-19
mario=6 at 10-19
carl=1 at 10-19
bambam=1 at 10-19
dylan=2 at 10-19
summer=1 at 10-19
anna=2 at 10-19
bob=2 at 10-19
fred=2 at 10-19
dan=1 at 10-19
fred=2 at 9223372036854775797-9223372036854775806
dan=1 at 9223372036854775797-9223372036854775806
carl=1 at 9223372036854775797-9223372036854775806
mario=6 at 9223372036854775797-9223372036854775806
vilma=1 at 9223372036854775797-9223372036854775806
edu=1 at 9223372036854775797-9223372036854775806
anna=7 at 9223372036854775797-9223372036854775806
summer=1 at 9223372036854775797-9223372036854775806
bambam=1 at 9223372036854775797-9223372036854775806
luigi=1 at 9223372036854775797-9223372036854775806
bob=2 at 9223372036854775797-9223372036854775806
dylan=2 at 9223372036854775797-9223372036854775806
Notice dylan=2 at 0-9 where it should be 1.