0
votes

We have 5 node kafka-cluster, with 2.3.0 Kafka version [ kafka_2.12-2.3.0 ] running with OpenJDK11.

And suddenly since last week we have one of the broker in the cluster going down.

We are noticing below error in controller.log of this node not completely sure what does this point to, can someone shed some light on this please, thanks in advance

2021-01-27 11:26:59,471 INFO kafka.controller.ZkPartitionStateMachine: [PartitionStateMachine controllerId=5] Stopped partition state machine
2021-01-27 11:26:59,472 INFO kafka.controller.ZkReplicaStateMachine: [ReplicaStateMachine controllerId=5] Stopped replica state machine
2021-01-27 11:26:59,472 INFO kafka.controller.KafkaController: [Controller id=5] Resigned
2021-01-27 11:27:55,955 ERROR kafka.controller.KafkaController: [Controller id=5] Error processing event RegisterBrokerAndReelect
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
    at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1725)
    at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689)
    at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97)
    at kafka.controller.KafkaController.processRegisterBrokerAndReelect(KafkaController.scala:1547)
    at kafka.controller.KafkaController.process(KafkaController.scala:1586)
    at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:53)
    at kafka.controller.ControllerEventManager$ControllerEventThread.$anonfun$doWork$1(ControllerEventManager.scala:137)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
    at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:137)
    at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:89)
2021-01-27 11:27:55,977 INFO kafka.controller.ZkPartitionStateMachine: [PartitionStateMachine controllerId=5] Stopped partition state machine
2021-01-27 11:27:55,977 INFO kafka.controller.ZkReplicaStateMachine: [ReplicaStateMachine controllerId=5] Stopped replica state machine
2021-01-27 11:27:55,977 INFO kafka.controller.KafkaController: [Controller id=5] Resigned
1

1 Answers

0
votes

from the error you Share it looks like that the connection that the controller/broker has with the zookeeper node hit the session timeout.

Usually a timeout occurs when the broker cannot reach the zookeeper node within the configured timeout.

Please be also aware that in new version the timeout (zookeeper.session.timeout.ms) is set to 18sec, but in old versions is set to just 6seconds which may be too low in some circumstances and we recommend to increase to 18s. That said, broker should have tried re-established the connection automatically after the timeout error if the connectivity was still working at the time What I suspect might have happened is that another broker was re-elected as the new controller.

And your broker/old controller try to connect zookeeper as a controller to avoid split brain issue didn't accept old controller request

to avoid this check your network & tune zookeeper.session.timeout.ms setting