kafka new producer is not able to update metadata after one of the broker is down

Question

I have an kafka environment which has 2 brokers and 1 zookeeper.

While I am trying to produce messages to kafka, if i stop broker 1(which is the leader one) the client stops producing messaging and give me the below error although the broker 2 is elected as a new leader for the topic and partions.

org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.

After 10 minutes passed, since broker 2 is new leader i expected producer to send data to broker 2 but it continued failing by giving above exception. lastRefreshMs and lastSuccessfullRefreshMs is still same although the metadataExpireMs is 300000 for producer.

I am using kafka new Producer implementation on producer side.

It seems that when producer is initiated, it binds to one broker and if that broker goes down it is not even trying to connect to another brokers in cluster.

But my expectation is if a broker goes down, it should directly check metadata for another brokers that are available and send data to them.

Btw my topic is 4 partition and has replication factor of 2. Giving this info in case it makes sense.

Configuration params.

{request.timeout.ms=30000, retry.backoff.ms=100, buffer.memory=33554432, ssl.truststore.password=null, batch.size=16384, ssl.keymanager.algorithm=SunX509, receive.buffer.bytes=32768, ssl.cipher.suites=null, ssl.key.password=null, sasl.kerberos.ticket.renew.jitter=0.05, ssl.provider=null, sasl.kerberos.service.name=null, max.in.flight.requests.per.connection=5, sasl.kerberos.ticket.renew.window.factor=0.8, bootstrap.servers=[10.201.83.166:9500, 10.201.83.167:9500], client.id=rest-interface, max.request.size=1048576, acks=1, linger.ms=0, sasl.kerberos.kinit.cmd=/usr/bin/kinit, ssl.enabled.protocols=[TLSv1.2, TLSv1.1, TLSv1], metadata.fetch.timeout.ms=60000, ssl.endpoint.identification.algorithm=null, ssl.keystore.location=null, value.serializer=class org.apache.kafka.common.serialization.ByteArraySerializer, ssl.truststore.location=null, ssl.keystore.password=null, key.serializer=class org.apache.kafka.common.serialization.ByteArraySerializer, block.on.buffer.full=false, metrics.sample.window.ms=30000, metadata.max.age.ms=300000, security.protocol=PLAINTEXT, ssl.protocol=TLS, sasl.kerberos.min.time.before.relogin=60000, timeout.ms=30000, connections.max.idle.ms=540000, ssl.trustmanager.algorithm=PKIX, metric.reporters=[], compression.type=none, ssl.truststore.type=JKS, max.block.ms=60000, retries=0, send.buffer.bytes=131072, partitioner.class=class org.apache.kafka.clients.producer.internals.DefaultPartitioner, reconnect.backoff.ms=50, metrics.num.samples=2, ssl.keystore.type=JKS}

Use Case:

1- Start BR1 and BR2 Produce data (Leader is BR1)

2- Stop BR2 produce data(fine)

3- Stop BR1(which means there is no active working broker in cluster at this time) and then Start BR2 and produce data (failed although leader is BR2)

4- Start BR1 produce data(leader is still BR2 but data is produced finely)

5- Stop BR2(now BR1 is leader)

6- Stop BR1(BR1 is still leader)

7- Start BR1 produce data(message is produced fine again)

If producer send the latest successful data to BR1 and then all brokers goes down, the producer expects BR1 to get up again although BR2 is up and new leader. Is this an expected behaviour?

can you post the configuration that you are using in your producer? — Nautilus
Due a lack of provided configuration I can only give an eductaed guess: You only listed one broker in the property: metadata.broker.list — TobiSH
@TobiSH config you provided is old producer config. i am using bootstrap.servers and i wrote my two broker host there. — jit

jit jit · Accepted Answer · 2016-03-02T09:20:08

After spending hours I figured out the behaviour of kafka in my situation. May be this is a bug or may be this needs to be done this way for the reasons lie under the hood but actually if i would do such implementation i wouldn't do this way :)

When all brokers goes down, if you are able to get up only one broker this must be the broker which went down last in order to produce messages successfully.

Let's say you have 5 brokers; BR1, BR2, BR3, BR4 and BR5. If all goes down and if the lastly dead broker is BR3(which was the last leader), although you start all brokers BR1, BR2, BR4 and BR5, it will not make any sense unless you start BR3.

kafka new producer is not able to update metadata after one of the broker is down

3 Answers