I want to keep state about 2TB in Flink using the Rocksdb state backend. I will use the incremental checkpoint, thus it will reduce the checkpoint time dramatically.
But I have to change code sometimes, e.g re-scaling, bug fix, adding new filter/mapping, adding new sources/sinks etc.
All of them can affect the job topology. I can bootstrap state again when any changes on state. But other times, bootstrap state could be difficult because that means time waste for me.
In these cases, I have to take a savepoint to restart my job. I also take savepoint periodically while job is running to restart job from the latest savepoint when the job is failed (e.g every 15 minutes). But the time while taking savepoint will be too long due to large state. MTTR (mean time to recovery) is very important for me. How can i improve savepoint performance?