I have an kafka environment which has 2 brokers and 1 zookeeper.
While I am trying to produce messages to kafka, if i stop broker 1(which is the leader one) the client stops producing messaging and give me the below error although the broker 2 is elected as a new leader for the topic and partions.
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
After 10 minutes passed, since broker 2 is new leader i expected producer to send data to broker 2 but it continued failing by giving above exception. lastRefreshMs and lastSuccessfullRefreshMs is still same although the metadataExpireMs is 300000 for producer.
I am using kafka new Producer implementation on producer side.
It seems that when producer is initiated, it binds to one broker and if that broker goes down it is not even trying to connect to another brokers in cluster.
But my expectation is if a broker goes down, it should directly check metadata for another brokers that are available and send data to them.
Btw my topic is 4 partition and has replication factor of 2. Giving this info in case it makes sense.
Configuration params.
{request.timeout.ms=30000, retry.backoff.ms=100, buffer.memory=33554432, ssl.truststore.password=null, batch.size=16384, ssl.keymanager.algorithm=SunX509, receive.buffer.bytes=32768, ssl.cipher.suites=null, ssl.key.password=null, sasl.kerberos.ticket.renew.jitter=0.05, ssl.provider=null, sasl.kerberos.service.name=null, max.in.flight.requests.per.connection=5, sasl.kerberos.ticket.renew.window.factor=0.8, bootstrap.servers=[10.201.83.166:9500, 10.201.83.167:9500], client.id=rest-interface, max.request.size=1048576, acks=1, linger.ms=0, sasl.kerberos.kinit.cmd=/usr/bin/kinit, ssl.enabled.protocols=[TLSv1.2, TLSv1.1, TLSv1], metadata.fetch.timeout.ms=60000, ssl.endpoint.identification.algorithm=null, ssl.keystore.location=null, value.serializer=class org.apache.kafka.common.serialization.ByteArraySerializer, ssl.truststore.location=null, ssl.keystore.password=null, key.serializer=class org.apache.kafka.common.serialization.ByteArraySerializer, block.on.buffer.full=false, metrics.sample.window.ms=30000, metadata.max.age.ms=300000, security.protocol=PLAINTEXT, ssl.protocol=TLS, sasl.kerberos.min.time.before.relogin=60000, timeout.ms=30000, connections.max.idle.ms=540000, ssl.trustmanager.algorithm=PKIX, metric.reporters=[], compression.type=none, ssl.truststore.type=JKS, max.block.ms=60000, retries=0, send.buffer.bytes=131072, partitioner.class=class org.apache.kafka.clients.producer.internals.DefaultPartitioner, reconnect.backoff.ms=50, metrics.num.samples=2, ssl.keystore.type=JKS}
Use Case:
1- Start BR1 and BR2 Produce data (Leader is BR1)
2- Stop BR2 produce data(fine)
3- Stop BR1(which means there is no active working broker in cluster at this time) and then Start BR2 and produce data (failed although leader is BR2)
4- Start BR1 produce data(leader is still BR2 but data is produced finely)
5- Stop BR2(now BR1 is leader)
6- Stop BR1(BR1 is still leader)
7- Start BR1 produce data(message is produced fine again)
If producer send the latest successful data to BR1 and then all brokers goes down, the producer expects BR1 to get up again although BR2 is up and new leader. Is this an expected behaviour?