4
votes

Trying to move from Flink 1.3.2 to 1.5 We have cluster deployed with kubernetes. Everything works fine with 1.3.2 but I can not submit job with 1.5. When I am trying to do that I just see spinner spin around infinitely, same via REST api. I even can't submit wordcount example job. Seems my taskmanagers can not connect to jobmanager, I can see them in flink UI, but in logs I see

level=WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException: connection timed out: flink-jobmanager-nonprod-2.rpds.svc.cluster.local/25.0.84.226:6123

level=WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://[email protected]:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://[email protected]:6123]] Caused by: [No response from remote for outbound association. Associate timed out after [20000 ms].]

level=WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException: connection timed out: flink-jobmanager-nonprod-2.rpds.svc.cluster.local/25.0.84.226:6123

But I can do telnet from taskmanager to jobmanager

Moreover everything works on my local if I start flink in cluster mode (jobmanager + taskmanager). In 1.5 documentation I found mode option which flip mode between flip6 and legacy (default flip6), but If I set mode: legacy I don't see my taskmanagers registered at all.

Is this something specific about k8s deployment and 1.5 I need to do? I checked 1.5 k8s config and it looks pretty same as we have, but we using customized docker image for flink (Security, HA, checkpointing)

Thank you.

1
I think you should check your dependency's consistency, one more time!Soheil Pourbafrani
Jobs rebuilt with flink 1.5.0 dependencies mentioned here flink.apache.org/downloads.html That's what we put in lib folder aws-java-sdk-1.7.4.jar, flink-dist_2.11-1.5.0.jar, flink-metrics-datadog-1.5.0.jar, flink-python_2.11-1.5.0.jar, flink-shaded-hadoop2-uber-1.5.0.jar, hadoop-aws-2.7.2.jar , httpclient-4.5.3.jar, httpcore-4.4.4.jar, jackson-annotations-2.6.7.jar, jackson-core-2.6.7.jar, jackson-databind-2.6.7.jar, joda-time-2.8.1.jar, log4j-1.2.17.jar, slf4j-log4j12-1.7.7.jarGeorgy Gobozov
Could you share the full client and cluster entrypoint logs with us @GeorgyGobozov? I would also be helpful to see your K8s deployment and service definition. In order to submit a job with the client you need to expose the rest endpoint port (8081) and the blob server port as a NodePort. If you only want to use the web UI it should be enough to expose these ports as ClusterIPTill Rohrmann
@TillRohrmann I am trying submit job from web UI only, at least now. Here is my k8s configs: pastebin.com/4W4KmvfR pastebin.com/1Rvd87Cc pastebin.com/Jd8mRXAH Switched to flink:latest images, but still getting issue with job submit. Trying to submit wordcount and getting on jobmanager "Could not connect to BlobServer at address flink-jobmanager-nonprod-2.rpds.svc.cluster.local/25.0.250.57:6124" "Caused by: java.net.ConnectException: Connection timed out"Georgy Gobozov
Could you check whether flink-jobmanager-nonprod-2 is reachable from the node on which the JobManager is deployed. There are some known problems of this kind with K8s: github.com/kubernetes/kubernetes/issues/20475, github.com/kubernetes/kubernetes/issues/19930 and github.com/kubernetes/kubernetes/issues/20391Till Rohrmann

1 Answers

1
votes

The issue with jobmanage connectivity. Jobmanager docker image cannot connect to "flink-jobmanager" (${JOB_MANAGER_RPC_ADDRESS}) address.

Just use afilichkin/flink-k8s Docker instead of flink:latest

I've fixed it by adding new host to jobmanager docker. You can see it in my github project

https://github.com/Aleksandr-Filichkin/flink-k8s/tree/master