0
votes

I have a Flink cluster and I used checkpoint via S3, Each minute I get a snapshot of the current state into S3 and it takes 20 seconde, but the snapshot use all network bandwidth (1 Gb/s) so my job get 20 seconde of latency each minute.

My question is, there is a way to limit bandwidth of checkpoint or disable the full network usage by checkpoint or another solution ?

Thx

1

1 Answers

0
votes

If you are not already doing so, I recommend you look into using incremental checkpointing (with RocksDB). This feature was added in Flink 1.4, and has proven to be very helpful for Flink applications with large state.

Incremental checkpointing is turned off by default. To enable it, pass true to the constructor, like this

RocksDBStateBackend backend =
    new RocksDBStateBackend(filebackend, true);

or set state.backend.incremental to true in your config file.

This won't directly address what you asked -- how to throttle the checkpointing so it doesn't saturate the network -- but should help, nonetheless.

Also, note that Amazon recommends using the Elastic Network Adapter in applications that make heavy use of S3. This provides up to 25 Gbps of bandwidth.

For more on working with large state in Flink, you might want to look at