2
votes

We have got 3 kafka brokers and topic with 40 partitions and replication factor set to 1. After uncontrolled kafka broker shutdown for some partition we see that it wasn't possible to elect new leader (see logs below). Eventually we cannot read from the topic. Please advise, if it is possible to survive such kind of crash without changing replication factor to bigger than 1.

We want to have a consistent state of our target database (created on the base on events from kafka topic) so we have also set parameter unclean.leader.election.enable to false.

Partition info after crash:

extenr-topic:1:882091242
extenr-topic:19:882091615
extenr-topic:28:882092273
Error: partition 18 does not have a leader. Skip getting offsets
Error: partition 27 does not have a leader. Skip getting offsets
Error: partition 36 does not have a leader. Skip getting offsets

Exception from kafka broker:

2017-10-09 05:56:50,302 ERROR state.change.logger: Controller 236 epoch 267 initiated state change for partition [extenr-topic,15] from OfflinePartition to OnlinePartition failed
kafka.common.NoReplicaOnlineException: No broker in ISR for partition [extenr-topic,15] is alive. Live brokers are: [Set(236, 237)], ISR brokers are: [235]
at kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:66)
at kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:342)
at kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:203)
at kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:118)
at kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:115)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)

There are also following errors in logs

2017-10-09 04:11:25,509 ERROR state.change.logger: Broker 235 received LeaderAndIsrRequest with correlation id 1 from controller 236 epoch 267 for partition [extenr-topic,36] but cannot become follower since the new leader -1 is unavailable.
1

1 Answers

2
votes

Partitions that have 1 as replication.factor will become Offline when their leader crashes/shuts down as there are no other available replicas to take over.

If availablility is important to you, I suggest increasing the replication factor. Recommended configs [1] for high availability is replication.factor set to 3 and min.insync.replicas set to 2.

1: http://kafka.apache.org/documentation/#brokerconfigs