The Kafka controller is brain of the Kafka cluster. It monitors the liveliness of the brokers and acts on broker failures.
There will be only one Kafka controller in the cluster. The controller is one of the Kafka brokers in the cluster, in addition to usual broker functionality, is also responsible for electing partition leaders whenever existing brokers leaves the cluster or when a broker joins the cluster.
The first broker that starts in the cluster will become the Kafka Controller by creating an ephemeral node called "/controller" in Zookeeper. When other brokers starts they also try to create this node in Zookeeper, but will receive an "node already exists" exception, by which they understands that there is already a Controller elected in the cluster.
When the Zookeeper doesn't receive heartbeat messages from the Controller, the ephemeral node in Zookeeper will get deleted. It then notifies all the other brokers in the cluster that the Controller is gone via Zookeeper watcher, which starts a new election for new Controller again. All the other brokers will again try to create a ephemeral node "/controller" and the first one to succeed will be elected as the new Controller.
There can be possibility of having more than one Controller in a cluster. Consider a case where a long GC (garbage collection) happened on the current Kafka Controller ("Controller_1") due to which Zookeeper didn't receive the heartbeat message from the Controller within the configured amount of time. This causes the "/controller" node being deleted from Zookeeper and another broker from the cluster gets elected as the new Controller ("Controller_2").
In this situation, we have 2 Controllers "Controller_1" and "Controller_2" in the cluster. "Controller_1" GC is finished and it may attempt to write/update the state in Zookeeper. "Controller_2" will also attempt to write/update the state in Zookeeper, which can lead to Kafka cluster being inconsistent with writes from both old Controller and new Controller.
In order to avoid it, a new "epoch" is generated every time a Controller election takes place.
Each time a controller is elected, it receives a new higher epoch through Zookeeper conditional increment operation.
With this, When an old Controller ("Controller_1") attempts to update something, Zookeeper compares the current epoch with the older epoch sent by the old Controller in its write/update request and it simply ignores it.
All the other brokers in the cluster also knows the current controller epoch and if they receive a message from old controller with an older epoch, they will ignore it as well.