Flink restore/restart job with huge state

Question

I am running flink cluster over K8 with ~1TB of state.

One of the problems I am facing is taking a savepoint and restoring a job back. Now, these updates are simple code updates at times and not parallelism changes. But the time to take a savepoint and then restoring the new job with the old state is pretty high.

Is there a way to do an in-place update of the job so that the local states and jobid do not change and hence can avoid the time consumed in doing the savepoint+restore?

David Anderson David Anderson · Accepted Answer · 2021-02-24T09:23:13

In many cases you can use retained (externalized) checkpoints instead of savepoints. This works, except in these cases:

rescaling with unaligned checkpoints (this restriction will go away; see FLINK-17979)
there are changes to the job topology involving state
there are changes to the types requiring state migration

You may find that topology changes and state migration work in some cases, but this is not guaranteed.

With large state on RocksDB you will want to use incremental checkpointing. Full checkpoints and savepoints take a long time, but incremental checkpoints are much faster.

If you do want to rescale, it's beneficial to increase the number of threads available to RocksDB, something more like this:

state.backend.rocksdb.checkpoint.transfer.thread.num: 8
state.backend.rocksdb.thread.num: 8

Flink restore/restart job with huge state

1 Answers