0
votes

I use Flink version 1.11 and have timout issue during savepoint

timeout exception snapshot


My savepoint size is around 4Gb ++
How to increase the savepoint timeout?

Thanks

2

2 Answers

0
votes

Please, refer to the Enabling and Configuring Checkpointing section of flink documentation.

You can increase the savepoint timeout to 1 minute via

// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig().setCheckpointTimeout(60000);

I would also recommend to increase the minimum time between checkpoints to make sure that the streaming application makes a certain amount of progress between checkpoints via

// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500);
0
votes

I had the same issue, as you can see from my very similar logs:

org.apache.flink.util.FlinkException: Triggering a savepoint for the job 63a70a46cf5bffda3ca0a1e791113122 failed.
at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:777)
at org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:754)
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
at org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:751)
at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1072)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)

Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:771)
... 10 more

The cause was not that the savepoint was timing out, but rather that the client communication was timing out.

For me, this was happening on EMR. Editing /etc/flink/conf.dist/flink-conf.yaml on the master node to add the following, which increased the timeout to 5 minutes, did the trick:

akka.client.timeout: 300000

For some additional color, the savepoint size I was working with was 160.3 GiB pulled from 4 instances.