We currently have around 80 applications (Around 200 K8s replicas) writing 16-17 Million records everyday to kafka and some of those records were failing intermittently with time out and rebalance exceptions. The failure rate was less than 0.02%.
We have validated and configured all the parameters properly as suggested by other stackoverflow links and still we are getting multiple issues.
One issue is related to Rebalance , We are facing issues on Producer and Consumer side both with this issue. For Consumer, we are using auto commit and sometimes Kafka is rebalancing, and consumer is receiving duplicate records. we didn't put any duplicate check because it will reduce the rate of processing and the duplicate record rate is less than 0.1%. We are thinking of going for manual commit and offset management using database. But need to understand from Kafka brokers perspective why rebalancing is happening on a daily basis.
Producer Error:
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.
Second issue is related to Timeout Exception . It's happening intermittently for some of the apps, Producer was trying to send a record and it has been added to the batch, and it was not unable to deliver until the request timeout which we have increased to 5minutes. Ideal case Kafka should be retrying at certain interval. During debugging ,we found that record accumulator is expiring the previous batching without even trying to send them in case of request time out - is it the expected behavior? Can we anyway add the retry for this?
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for xxTopic-41:300014 ms has passed since batch creation. </b>
Configuration:
1. 5 brokers and 3 zookeepers - Kafka version 2.2
2. Brokers are running in Kubernetes with statefulset.
3. Each broker has 32GB and 8 CPU as recommended by Confluent for Production.
4. Topic has 200 partitions and 8 replica consumers.
5. Each consumer is handling around 25-30 threads only. Consumer has 4GB and 4CPU capacity.
@Value("${request.timeout:300000}") <br/>
private String requestTimeOut;
@Value("${batch.size:8192}") <br/>
private String batchSize;
@Value("${retries:5}") <br/>
private Integer kafkaRetries;
@Value("${retry.backoff.ms:10000}") <br/>
private Integer kafkaRetryBackoffMs;
As we are from the development team, didn't have much insights into networking aspect, Need help whether this is something related to network congestion or we need to improve anything in the application itself. We didn't face any issues when the load was less than 10 Million per day and with lot of new apps sending the messages and increased load, we are seeing the above mentioned two issues intermittently.