I am building a streaming app using Flink 1.3.2 with scala, my Flink app will monitor a folder and stream new files into pipeline. Each record in the file has a timestamp associated. I want to use this timestamp as the event time and build watermark using AssignerWithPeriodicWatermarks[T], my watermark generator looks like below:
class TimeLagWatermarkGenerator extends AssignerWithPeriodicWatermarks[Activity] {
val maxTimeLag = 6 * 3600000L // 6 hours
override def extractTimestamp(element: Activity, previousElementTimestamp: Long): Long = {
val format = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssXXX")
val timestampString = element.getTimestamp
}
override def getCurrentWatermark(): Watermark = {
new Watermark(System.currentTimeMillis() - maxTimeLag)
}
}
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.getConfig.setAutoWatermarkInterval(10000L)
val stream = env.readFile(inputformart, path, FileProcessingMode.PROCESS_CONTINUOUSLY, 100)
val activity = stream
.assignTimestampsAndWatermarks(new TimeLagWatermarkGenerator())
.map { line =>
new tuple.Tuple2(line.id, line.count)
}.keyBy(0).addSink(...)
However, since my folder has some old data there, I don't want to process them. And the timestamp of records in older file are > 6 hours, which should be older than watermark. However, when I start running it, I can still see some initial output been created. I was wondering how the initial value of watermark been set up, is it the before the first interval or after? It might be I misunderstand something here but need some advice.