We have five node cluster running in production with 3 zookeepers - all are VMs. We have to restart the cluster often for some hardware patching.
We have written an ansible script to shutdown the cluster in the following order,
- Stop Kafka connect (1, 2, 3 nodes sequentially) by killing the process
- Stop Kafka (1, 2, 3, 4, 5 nodes sequentially) using kafka-server-stop.sh
- Stop Zookeeper (1, 2, 3 nodes sequentially) using zookeeper-server-stop.sh
After patching, start script will do the following
- Start Zookeeper (1, 2, 3 nodes sequentially) using zookeeper-server-start.sh
- Start Kafka (1, 2, 3, 4, 5 nodes sequentially) using kafka-server-start.sh
- Start Kafka connect (1, 2, 3 nodes sequentially) using connect-distributed.sh
The issue is with the #3 step of start script, we have kept a hard coded delay about 10 mins before executing #3 (starting kafka connect) to make kafka cluster is fully up and running. But sometimes, some of the nodes in the cluster take more time to start, hence kafka connect start up fails even after the delay - In this case we have to wait for 30 mins and try restarting the connect manually again.
Is there any way to make sure that all nodes in the cluster is up and running, before I start the other processes?
Thanks in Advance.
Hard coded delay does not work, we can't keep on changing the delay with some assumption