0
votes

I've been having a recurring issue with a kafka cluster running on AWS EC2 instances.

Description

  • Kafka cluster version 0.10.1.0
  • 3 brokers cluster
  • topics have 6 partitions per broker
  • Instance type is m4.xlarge

Symptoms

The following will happen at random intervals, on random brokers

From the logs here is the information I could gather :

  1. Shrinking Intra-cluster replication on a random broker (I suppose it could be a temporary network failure but couldn't produce evidence of it)

  2. System starts showing close to no activity @02:27:20 (note that it's not load related as it happens at very quiet times)

enter image description here

  1. From there, this kafka broker doesn't process messages which is expected IMO as it dropped out of the cluster replication.

  2. Now the real issue appears as the number of connections in CLOSE_WAIT is constantly increasing until it reaches the configured ulimit of the system/process, ending up crashing the kafka process.

Now, I've been changing limits to see if kafka would eventually join again the ISR before crashing but even with a limit that's very high, kafka just seems stuck in a weird state and never recovers.

Note that between the time when the faulty broker is on its own and the time it crashes, kafka is listening and kafka producer.

For this single crash, I could see 320 errors like this from the producers :

java.util.concurrent.ExecutionException: org.springframework.kafka.core.KafkaProducerException: Failed to send; nested exception is org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.

The configuration being the default one and the use being quite standard, I'm wondering if I missed something.

I put in place a script that check the number of kafka file descriptors and restarts the service when it gets abnormally high, which does the trick for now but I still lose messages when it crashes.

Any help to get to the bottom of this would be appreciated.

1

1 Answers

0
votes

Turns out there was a deadlock in the version I was using.

Upgrading fixed the issue.

See ticket about the issue :

https://issues.apache.org/jira/browse/KAFKA-5721