I'm trying to use Flink to consume a bounded data from a message queue in a streaming passion. The data will be in the following format:
{"id":-1,"name":"Start"}
{"id":1,"name":"Foo 1"}
{"id":2,"name":"Foo 2"}
{"id":3,"name":"Foo 3"}
{"id":4,"name":"Foo 4"}
{"id":5,"name":"Foo 5"}
...
{"id":-2,"name":"End"}
The start and end of messages can be determined using the event id. I want to receive such batches and store the latest (by overwriting) batch on disk or in memory. I can write a custom window trigger to extract the events using the start and end flags as shown below:
DataStream<Foo> fooDataStream = ...
AllWindowedStream<Foo, GlobalWindow> fooWindow = fooDataStream.windowAll(GlobalWindows.create())
.trigger(new CustomTrigger<>())
.evictor(new Evictor<Foo, GlobalWindow>() {
@Override
public void evictBefore(Iterable<TimestampedValue<Foo>> elements, int size, GlobalWindow window, EvictorContext evictorContext) {
for (Iterator<TimestampedValue<Foo>> iterator = elements.iterator();
iterator.hasNext(); ) {
TimestampedValue<Foo> foo = iterator.next();
if (foo.getValue().getId() < 0) {
iterator.remove();
}
}
}
@Override
public void evictAfter(Iterable<TimestampedValue<Foo>> elements, int size, GlobalWindow window, EvictorContext evictorContext) {
}
});
but how can I persist the output of the latest window. One way would be using a ProcessAllWindowFunction
to receive all the events and write them to disk manually but it feels like a hack. I'm also looking into the Table API with Flink CEP Pattern (like this question) but couldn't find a way to clear the Table after each batch to discard the events from the previous batch.