Flink: HA mode killing leading jobmanager terminating standby jobmanagers

Question

I am trying to get Flink to run in HA mode using Zookeeper, but when I try to test it by killing the leader JobManager all my standby jobmanagers get killed too.

So instead of a standby jobmanager taking over as the new Leader, they all get killed which isn't supposed to happen.

My setup: 4 servers, 3 of those servers have Zookeeper running, but only 1 server will host all the JobManagers.

ad011.local: Zookeeper + Jobmanagers
ad012.local: Zookeeper + Taskmanager
ad013.local: Zookeeper
ad014.local: nothing interesting

My masters file looks like this:

ad011.local:8081
ad011.local:8082
ad011.local:8083

My flink-conf.yaml:

jobmanager.rpc.address: ad011.local

blob.server.port: 6130,6131,6132

jobmanager.heap.mb: 512
taskmanager.heap.mb: 128
taskmanager.numberOfTaskSlots: 4
parallelism.default: 2
taskmanager.tmp.dirs: /var/flink/data

metrics.reporters: jmx
metrics.reporter.jmx.class: org.apache.flink.metrics.jmx.JMXReporter
metrics.reporter.jmx.port: 8789,8790,8791

high-availability: zookeeper
high-availability.zookeeper.quorum: ad011.local:2181,ad012.local:2181,ad013.local:2181

high-availability.zookeeper.path.root: /flink
high-availability.zookeeper.path.cluster-id: /cluster-one
high-availability.storageDir: /var/flink/recovery
high-availability.jobmanager.port: 50000,50001,50002

When I run flink by using start-cluster.sh script I see my 3 JobManagers running, and going to the WebUI they all point to ad011.local:8081, which is the leader. Which is okay I guess?

I then try to test the failover by killing the leader using kill and then all my other standby JobManagers stop too.

This is what I see in my standby JobManager logs:

2017-09-29 08:08:41,590 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager at akka.tcp://[email protected]:50002/user/jobmanager.
2017-09-29 08:08:41,590 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService@72d546c8.
2017-09-29 08:08:41,598 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Starting with JobManager akka.tcp://[email protected]:50002/user/jobmanager on port 8083
2017-09-29 08:08:41,598 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
2017-09-29 08:08:41,645 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader reachable under akka.tcp://[email protected]:50000/user/jobmanager:f7dc2c48-dfa5-45a4-a63e-ff27be21363a.
2017-09-29 08:08:41,651 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
2017-09-29 08:08:41,722 INFO  org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager  - Received leader address but not running in leader ActorSystem. Cancelling registration.
2017-09-29 09:26:13,472 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://[email protected]:50000] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2017-09-29 09:26:14,274 INFO  org.apache.flink.runtime.jobmanager.JobManager                - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2017-09-29 09:26:14,284 INFO  org.apache.flink.runtime.blob.BlobServer                      - Stopped BLOB server at 0.0.0.0:6132

Any help would be appreciated.

darkownage darkownage · Accepted Answer · 2017-10-16T10:24:17

Solved it by running my cluster using ./bin/start-cluster.sh instead of using service files (which calls the same script), the service file kills the other jobmanagers apparently.

Flink: HA mode killing leading jobmanager terminating standby jobmanagers

1 Answers