We are deploying a new Flink stream processing job and it's state (stores) need to be initialized with historical data and this data should be available in the state store before it starts processing any new application events. We don't want to significantly modify the Flink job to also load the historical data. We considered writing another, separate Flink job to process the historical data, update it's state store and create a Savepoint and use this Savepoint to initialize the state in the main Flink job. Looks like State Processor API only works with DataSet API and wondering about any alternative solutions. Thanks.
0
votes
2 Answers
1
votes
1
votes
It's a pretty simple change (definitely not "significant") to support a -preload
mode for your job, where the non-historical data sources get replaced by empty/non-terminating sources. I typically use counters to decide when state has been fully populated, then stop with a savepoint, and restart without the -preload
option.