0
votes

We have a kafka cluster with 4 brokers. We have setup the topic with the configuration replication.factor=3, min.insync.replicas=2

We noticed that whenever a single broker fails, our producers start failing within 60-90 seconds with the below error

org.apache.kafka.common.errors.TimeoutException: Batch containing 19 record(s) expired due to timeout while requesting metadata from brokers for a-13
[ERROR] ERROR Parser:567 - org.apache.kafka.common.errors.TimeoutException: Batch containing 19 record(s) expired due to timeout while requesting metadata from brokers for a-13

We have the below producer configs on the producer side.

acks=all, 
request.timeout.ms=120000
retry.backoff.ms=5000
retries=3
linger.ms=250
max.in.flight.requests.per.connection=2

As per the configuration will the producer take atleast 6 minutes before failing? As request.timeout.ms=2 minutes and retries=3?

We do not have unclean leader election enabled. We are running Kafka 2.0 and the producer client version is 0.10.0.1.

We have the replica.lag.time.max.ms is set to 10s on the brokers. When the issue happened we noticed that the leader re-election happened within 40seconds. So I am confused why the producers are failing almost instantly when one broker goes down.

I can provide more info if required.

1
You set acks=all. Which requires all broker to be up, and why is your producer so low?OneCricketeer
What do you mean by low?user3679686
The producer version. You are runnnig Kafka 2.0, and so your client version should be at least 0.11, but upgrade all the way would get better benefitsOneCricketeer
Yes the client version is very old but that is the system we inherited. I want to be sure if the version compatibility might be the one causing this issue.user3679686
It's possible...You would have to read the Kafka release notes. Then again, you're having timeout errors that are just related to temporal effects, so increasing at least linger.ms would be one possibility, or reducing the batch/buffer sizesOneCricketeer

1 Answers

0
votes

You set acks=all, and failed to mention which broker is down.

Sounds like the failed broker hosted one of the topic's partitions, and the ack is failing.