1
votes

Context: the project I'm working on processes timestamped files that are produced periodically (1 min) and they are ingested in real time into a series of cascading window operators. The timestamp of the file indicates the event time, so I don't need to rely on the file creation time. The result of the processing of each window is sent to a sink which stores the data in several tables.

input -> 1 min -> 5 min -> 15 min -> ...
          \-> SQL  \-> SQL  \-> SQL

I am trying to come up with a solution to deal with possible downtime of the real time process. The input files are generated independently, so in case of severe downtime of the Flink solution, I want to ingest and process the missed files as if they were ingested by the same process.

My first thought is to configure a mode of operation of the same flow which reads only the missed files and has an allowed lateness which covers the earliest file to be processed. However, once a file has been processed, it is guaranteed that no more late files will be ingested, so I don't necessarily need to maintain the earliest window open for the duration of the whole process, especially since there may be many files to process in this manner. Is it possible to do something about closing windows, even with the allowed lateness set, or maybe I should look into reading the whole thing as a batch operation and partition by timestamp instead?

1
Two questions: Are you using event time processing, with Watermarks? Do you ingest the input files in order?David Anderson
Yes to both: I am using event time, generating watermarks each time a file is ingested, and in the current schema files can be assumed to be ingested in order, since they are generated and detected in order, but should I need to implement a batch read, I would ingest them in order, yes.Tormenta

1 Answers

0
votes

Since you are ingesting the input files in order, using event time processing, I don't see why there's an issue. When the Flink job recovers, it seems like it should be able to resume from where it left off.

If I've misunderstood the situation, and you sometimes need to go back and process (or reprocess) a file from some point in the past, one way to do this would be to deploy another instance of the same job, configured to only ingest the file(s) needing to be (re)ingested. There shouldn't be any need to rewrite this as a batch job -- most streaming jobs can be run on bounded inputs. And with event time processing, this backfill job will produce the same results as if it had been run in (near) real-time.