Apache Flink: AWS S3 timeout exception when starting a job from a savepoint

Question

I have a Flink job which has large state in a Map operator. We are taking savepoint which has around 80GB storing to AWS S3. We have around 100 parallelism for this operator. However, when we recover from the savepoint, there is always exception like

Caused by: java.io.InterruptedIOException: Failed to open s3a://xxxx/UbfK/flink/savepoints/logjoin/savepoint-a9600e-39fd2cc07076/f9a31490-461a-42f3-88be-ec169145c35f at 0 on s3a://adshonor-data-cube-test-apse1/UbfK/flink/savepoints/logjoin/savepoint-a9600e-39fd2cc07076/f9a31490-461a-42f3-88be-ec169145c35f: org.apache.flink.fs.s3base.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool.

Is there a configuration parameter to increase the timeout settings for AWS S3 in Flink or another way to avoid this error?

We have tried modifying the fs.s3a.connection.maximum but it doesn't work. Timeout error still happens. We finally solve the problem temporally by reduce each taskmanager's slot number. Previously we have a few big taskmanagers which have 30-40 slots. By downgrade the machine size and reduce the slots number to 8 and increase the number of the taskmanagers, we finally eliminate the timeout errors. — Hu Guang

stevel stevel · Accepted Answer · 2019-01-11T11:27:11

1

votes

Try setting fs.s3a.connection.maximum to something like 50 or 100

Apache Flink: AWS S3 timeout exception when starting a job from a savepoint

3 Answers