How to detect a Kafka Streams app in zombie state

Question

One of our Kafka Streams Application's StreamThread consumers entered a zombie state after producing the following log message:

[Consumer clientId=notification-processor-db9aa8a3-6c3b-453b-b8c8-106bf2fa257d-StreamThread-1-consumer, groupId=notification-processor] Member notification-processor-db9aa8a3-6c3b-453b-b8c8-106bf2fa257d-StreamThread-1-consumer-b2b9eac3-c374-43e2-bbc3-d9ee514a3c16 sending LeaveGroup request to coordinator ****:9092 (id: 2147483646 rack: null) due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.

It seems the StreamThread's Kafka Consumer has left the consumer group, but the Kafka Streams App remained in a RUNNING state while not consuming any new records.

I would like to detect that a Kafka Streams App has entered such a zombie state so it can be shut down and replaced with a new instance. Normally we do this via a Kubernetes health check that verifies that the Kafka Streams App is in a RUNNING or REPARTITIONING state, but that is not working for this case.

Therefore I have two questions:

Is it to be expected that the Kafka Streams app remains in a RUNNING state when it has no active consumers? If yes: why?
How can we detect (programatically / via metrics) that a Kafka Streams app has entered such a zombie state where it has no active consumer?

Matthias J. Sax Matthias J. Sax · Accepted Answer · 2020-04-18T21:38:36

Is it to be expected that the Kafka Streams app remains in a RUNNING state when it has no active consumers? If yes: why?

It depends on the version. In older version (2.1.x and older), Kafka Streams would indeed stay in RUNNING state, even if all threads died. This issue is fixed in v2.2.0 via https://issues.apache.org/jira/browse/KAFKA-7657.

How can we detect (programatically / via metrics) that a Kafka Streams app has entered such a zombie state where it has no active consumer?

Even in older versions, you can register an uncaught exception handler on the KafkaStreams client. This handler is invoked each time a StreamThreads dies.

Btw: In upcoming 2.6.0 release, a new metric alive-stream-threads is added to track the number of running threads: https://issues.apache.org/jira/browse/KAFKA-9753

How to detect a Kafka Streams app in zombie state

1 Answers