One of our Kafka Streams Application's StreamThread consumers entered a zombie state after producing the following log message:
[Consumer clientId=notification-processor-db9aa8a3-6c3b-453b-b8c8-106bf2fa257d-StreamThread-1-consumer, groupId=notification-processor] Member notification-processor-db9aa8a3-6c3b-453b-b8c8-106bf2fa257d-StreamThread-1-consumer-b2b9eac3-c374-43e2-bbc3-d9ee514a3c16 sending LeaveGroup request to coordinator ****:9092 (id: 2147483646 rack: null) due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
It seems the StreamThread's Kafka Consumer has left the consumer group, but the Kafka Streams App remained in a RUNNING state while not consuming any new records.
I would like to detect that a Kafka Streams App has entered such a zombie state so it can be shut down and replaced with a new instance. Normally we do this via a Kubernetes health check that verifies that the Kafka Streams App is in a RUNNING or REPARTITIONING state, but that is not working for this case.
Therefore I have two questions:
- Is it to be expected that the Kafka Streams app remains in a RUNNING state when it has no active consumers? If yes: why?
- How can we detect (programatically / via metrics) that a Kafka Streams app has entered such a zombie state where it has no active consumer?