When primary Kafka Broker dies ISR doesn't expand to maintain replication

Question

I am testing resilience of Kafka (apache; kafka_2.12-1.1.0). What i expect is that ISR of a topic should increase it self (i.e. replicate to available node) when ever a node crashes. I spent 4 days googling for possible solutions, but was of no use.

Have 3 node cluster, and created 3 brokers, 3 zoo keepers on it (1node = 1broker + 1 zookeeper) using docker (wurstmeister) updated the below in server.properties

offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=2
min.insync.replicas=2
default.replication.factor=3

started all brokers; waited a minute; created topic with replication3, min in sync replication 2

bin/kafka-topics.sh --create --zookeeper 172.31.31.142:2181,172.31.26.102:2181,172.31.17.252:2181  --config 'min.insync.replicas=2' --replication-factor 3 --partitions 1 --topic test2

when i describe the topic i see the below data

bash-4.4# bin/kafka-topics.sh --describe --zookeeper zookeeper:2181 --topic test2
Topic:test2     PartitionCount:1        ReplicationFactor:3     Configs:min.insync.replicas=2
        Topic: test2    Partition: 0    Leader: 2       Replicas: 2,3,1 Isr: 2,3,1

So far so good, Now i start consuers; followd by producers. When the consumpmtion is in full throttle i kill the broker #2. Now when i describe the same topic i see the below ([Edit-1])

bash-4.4# bin/kafka-topics.sh --describe --zookeeper zookeeper:2181 --topic test2
Topic:test2     PartitionCount:1        ReplicationFactor:3     Configs:min.insync.replicas=2
        Topic: test2    Partition: 0    Leader: 3       Replicas: 2,3,1 Isr: 3,1

bash-4.4# bin/kafka-topics.sh --describe --zookeeper zookeeper:2181 --topic __consumer_offsets
Topic:__consumer_offsets        PartitionCount:50       ReplicationFactor:3     Configs:segment.bytes=104857600,cleanup.policy=compact,compression.type=producer Topic: __consumer_offsets       Partition: 0    Leader: 1       Replicas: 1,2,3 Isr: 1,3
        Topic: __consumer_offsets       Partition: 1    Leader: 3       Replicas: 2,3,1 Isr: 1,3
    .. .. ..

[end of edit-1]

I let the kafka producer, consumer continue for couple of minutes; Q1: why does Replicas still show 2 when the broker 2 is down?

Now i added 2 more brokers to the cluster. While the producer, consumers continue i keep observing ISR; the no of ISR replicas dont increase they stick to 3,1 only. Q2: why is ISR not increasing even though 2 more brokers are available?.

Then i stopped the producer, consumer; waited couple of minutes; re-ran the describe command again --stillthe same result. when does ISR expand its replication?. Where there are 2 more nodes available, why did ISR not replicate?

i crreate my producer as follows

props.put("acks", "all");
props.put("retries", 4);
props.put("batch.size", new Integer(args[2]));// 60384
props.put("linger.ms", new Integer(args[3]));// 1
props.put("buffer.memory", args[4]);// 33554432
props.put("bootstrap.servers", args[6]);// host:port,host:port,host:port etc
props.put("max.request.size", "10485760");// 1048576

and consumer as follows

props.put("group.id", "testgroup");
    props.put("enable.auto.commit", "true");
    props.put("auto.commit.interval.ms", args[2]);// 1000
    props.put("session.timeout.ms", "30000");
    props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
    props.put("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer");
    props.put("max.partition.fetch.bytes", args[3]);// 52428800
    props.put("fetch.max.bytes", args[4]);// 1048576
    props.put("fetch.message.max.bytes", args[5]);// 1048576
    props.put("bootstrap.servers", args[6]);
    props.put("max.poll.records", args[7]);
    props.put("max.poll.interval.ms", "30000");
    props.put("auto.offset.reset", "latest");

In a separate experiment, when i removed another broker the i started seeing errors that total in sync replications are less than the minimum required. Surprizingly in this state the producer is not blocked; but i see the error on the broker server.log. No new messages are getting enqueued. Q4:Shouldnt producer be blocked? instead of throwing error on broker side? or is my understanding wrong?

Any help please?

OneCricketeer OneCricketeer · Accepted Answer · 2018-07-16T04:09:37

If I understand correctly, Kafka does not auto rebalance when you add brokers. A down replica will not be reassigned unless you use the repartition tool

It's not clear what difference are between your environments, but it looks like you didn't really kill a broker if it's still listed as a leader.

if you had two brokers down with min ISR as 2, then, yes you'll see errors. The producer should still be able to reach at least one broker, though, so I don't think it'll be completely blocked unless you set the ack value to all. The errors at the broker end are more related to placing replicas

When primary Kafka Broker dies ISR doesn't expand to maintain replication

2 Answers