2
votes

We run a Kafka cluster in Kubernetes based on the gcr.io/google_containers/kubernetes-kafka:1.0-10.2.1 docker image with the zookeeper backend using gcr.io/google_containers/kubernetes-zookeeper:1.0-3.4.10 with three instances of both kafka and zookeeper.

We have a few different consumer groups that both consume and produces data on three different topics.

Behaviour: Sometimes a consumer group will set their offset for a topic on a partition to -1 and from then on stop consuming on that topic all together. If we restart our consumers we might see them setting their offset to the latest offset, which might mean that the consumer has missed messages in the time between it going to -1 and being restarted.

I'm having issues finding why a consumer group would ever set its offset to -1 and why it would do so "randomly" after days of uptime. Is there any logical explanation to why Kafka would set this offset for a certain consumer? Cannot see anything in our actual consumers that indicates that they explicitly are doing this.

We are currently having consumers both running in golang and in Node.js, where all are facing this issue, so our current assumption is that this issue does not have to do with our consumers, but rather with our Kafka setup.

1
+1 we're experiencing the exact same issue, unfortunately i've ended up purging the entire kafka cluster and re-installing. in production, from vacation. this is entirely a kafka on kubernetes issue that still eludes me. i've also tried purging the topics or resetting the offset, both of which i've failed miserable at. all of our consumers and producers are written in nodejs. - Victor Palade
Can you check if your data is getting deleted because of retention policy? - Bitswazsky
Yes, that's not the issue :) - poppe

1 Answers

0
votes

The default offset retention policy offsets.retention.minutes used to be 1 day and in older Kafka versions the offset got wiped out even for active consumers. Fixed with KIP-211

We originally discovered this with Kafka 0.10.2.1, a few select topics lost the consumer group offsets (i.e., turned to -1) because no messages arrived on the topic for a couple of days and the offset retention policy kicked in and wiped out offsets for active consumers.

We were able to workaround this by increasing the retention setting to 7 days which seems to be what Kafka also ended up doing, see KIP-186