I am new to Flink. I am replacing Kafka Streams API with Flink, because Kafka Streams is internally creating multiple internal topics which is adding overhead.
However, in the Flink job, all I am doing is
- Dedupe the records in given window (1hr).
(Window(TumblingEventTimeWindows(3600000), EventTimeTrigger, Job$$Lambda$1097/1241473750, PassThroughWindowFunction))
deDupedStream = deserializedStream
.keyBy(msg -> new StringBuilder()
.append("XXX").append("YYY"))
.timeWindow(Time.milliseconds(3600000)) // 1 hour
.reduce((event1, event2) -> {
event2.setEventTimeStamp(Math.max(event1.getEventTimeStamp(), event2.getEventTimeStamp()));
return event2;
})
.setParallelism(mapParallelism > 0 ? mapParallelism : defaultMapParallelism);
- After Deduping, I do another level of windowing and count the records before producing to kafka topic.
(Window(TumblingEventTimeWindows(3600000), EventTimeTrigger, Job$$Lambda$1101/2132463744, PassThroughWindowFunction) -> Map)
SingleOutputStreamOperator<PlImaItemInterimMessage> countedStream = deDupedStream
.filter(event -> event.getXXX() != null)
.map(this::buildXXXObject)
.returns(XXXObject.class)
.setParallelism(deDupMapParallelism > 0 ? deDupMapParallelism : defaultDeDupMapParallelism)
.keyBy(itemInterimMsg -> String.valueOf("key1") + "key2" + "key3")
.timeWindow(Time.milliseconds(3600000))
.reduce((existingMsg, currentMsg) -> { // Aggregate
currentMsg.setCount(existingMsg.getCount() + currentMsg.getCount());
return currentMsg;
})
.setParallelism(deDupMapParallelism > 0 ? deDupMapParallelism : defaultDeDupMapParallelism);
countedStream.addSink(kafkaProducerSinkFunction);
With the above setup, my assumption is the destination kafka topic will get the aggregated results every 3600000ms (1 hour). But Grafana graph shows the the result emits every near 30 mins. I do not understand why, when the window is still 1 hour range. Any suggestions?