0
votes

We are deploying a new Flink stream processing job and it's state (stores) need to be initialized with historical data and this data should be available in the state store before it starts processing any new application events. We don't want to significantly modify the Flink job to also load the historical data. We considered writing another, separate Flink job to process the historical data, update it's state store and create a Savepoint and use this Savepoint to initialize the state in the main Flink job. Looks like State Processor API only works with DataSet API and wondering about any alternative solutions. Thanks.

2

2 Answers

1
votes

The State Processor API is a good solution. It provides a sort of savepoint connector that you use in a DataSet job to read/modify/update the savepoints that you use in your DataStream jobs.

1
votes

It's a pretty simple change (definitely not "significant") to support a -preload mode for your job, where the non-historical data sources get replaced by empty/non-terminating sources. I typically use counters to decide when state has been fully populated, then stop with a savepoint, and restart without the -preload option.