1
votes

I am trying to upgrade our job from flink 1.4.2 to 1.7.1 but I keep running into timeouts after submitting the job. The flink job runs on our hadoop cluster (version 2.7) with Yarn.

I've seen the following behavior:

When the timeout happens I get the following stacktrace:

INFO    class java.time.Instant does not contain a getter for field seconds
INFO    class com.bol.fin_hdp.cm1.domain.Cm1Transportable does not contain a getter for field globalId
INFO    Submitting job 5af931bcef395a78b5af2b97e92dcffe (detached: false).
INFO    ------------------------------------------------------------
INFO    The program finished with the following exception:
INFO    org.apache.flink.client.program.ProgramInvocationException: The main method caused an error.
INFO    at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:545)
INFO    at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
INFO    at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
INFO    at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:798)
INFO    at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:289)
INFO    at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
INFO    at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1035)
INFO    at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1111)
INFO    at java.security.AccessController.doPrivileged(Native Method)
INFO    at javax.security.auth.Subject.doAs(Subject.java:422)
INFO    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
INFO    at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
INFO    at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1111)
INFO    Caused by: java.lang.RuntimeException: org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
INFO    at com.bol.fin_hdp.job.starter.IntervalJobStarter.startJob(IntervalJobStarter.java:43)
INFO    at com.bol.fin_hdp.job.starter.IntervalJobStarter.startJobWithConfig(IntervalJobStarter.java:32)
INFO    at com.bol.fin_hdp.Main.main(Main.java:8)
INFO    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
INFO    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
INFO    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
INFO    at java.lang.reflect.Method.invoke(Method.java:498)
INFO    at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
INFO    ... 12 more
INFO    Caused by: org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
INFO    at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:258)
INFO    at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
INFO    at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
INFO    at com.bol.fin_hdp.cm1.job.Job.execute(Job.java:54)
INFO    at com.bol.fin_hdp.job.starter.IntervalJobStarter.startJob(IntervalJobStarter.java:41)
INFO    ... 19 more
INFO    Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
INFO    at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:371)
INFO    at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
INFO    at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
INFO    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
INFO    at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
INFO    at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:216)
INFO    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
INFO    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
INFO    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
INFO    at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
INFO    at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:301)
INFO    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
INFO    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
INFO    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
INFO    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
INFO    at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:214)
INFO    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
INFO    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
INFO    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
INFO    at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
INFO    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
INFO    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
INFO    at java.lang.Thread.run(Thread.java:748)
INFO    Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.
INFO    at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
INFO    ... 17 more
INFO    Caused by: java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: connection timed out: shd-hdp-b-slave-01...
INFO    at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
INFO    at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
INFO    at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
INFO    at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
INFO    ... 15 more
INFO    Caused by: org.apache.flink.shaded.netty4.io.netty.channel.ConnectTimeoutException: connection timed out: shd-hdp-b-slave-017.example.com/some.ip.address:46500
INFO    at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:212)
INFO    ... 7 more

What changed in flip-6 that might cause this behavior and how can I fix this?

2

2 Answers

2
votes

For our jobs on YARN w/Flink 1.6, we had to bump up the web.timeout setting via -yD web.timeout=100000.

1
votes

In our case, there was a firewall between the machine submitting the job and our Hadoop cluster.

In newer Flink versions (1.7 and up) Flink uses REST to submit jobs. The port number for this REST service is random on yarn setups and could not be set.

Flink 1.8.0 introduced a config option to set this to a port or port range using:

rest.bind-port: 55520-55530