Context: the project I'm working on processes timestamped files that are produced periodically (1 min) and they are ingested in real time into a series of cascading window operators. The timestamp of the file indicates the event time, so I don't need to rely on the file creation time. The result of the processing of each window is sent to a sink which stores the data in several tables.
input -> 1 min -> 5 min -> 15 min -> ...
\-> SQL \-> SQL \-> SQL
I am trying to come up with a solution to deal with possible downtime of the real time process. The input files are generated independently, so in case of severe downtime of the Flink solution, I want to ingest and process the missed files as if they were ingested by the same process.
My first thought is to configure a mode of operation of the same flow which reads only the missed files and has an allowed lateness which covers the earliest file to be processed. However, once a file has been processed, it is guaranteed that no more late files will be ingested, so I don't necessarily need to maintain the earliest window open for the duration of the whole process, especially since there may be many files to process in this manner. Is it possible to do something about closing windows, even with the allowed lateness set, or maybe I should look into reading the whole thing as a batch operation and partition by timestamp instead?