We are seeing unexpected rebalances in Java Kafka consumers, described below. Do these problems sound familiar to anybody? Any tips on APIs or debug techniques to figure out rebalance causes?
Two processes are reading a topic. Sometimes all partitions on the topic get rebalanced to a single reader process. After restarting both processes, partitions get evenly balanced.
Two processes are reading a topic. Sometimes a long sequence of rebalances bounces partitions from reader to reader. We call pause/resume on consumers for backpressure, which should prevent this.
Two processes are reading a topic. Sometimes a rebalance happens when it looks like both processes are reading ok. Afterwards, reading works ok, but it's a hiccup in processing.
We expect partitions would not rebalance without also seeing some cause or failure.
Sometimes poll()
gets stuck (exceeds the timeout) and we use wakeup()
and close()
, then create new consumers. Sometimes coordinator heartbeat threads keep running after consumers are closed (we've seen thousands). The timing seems unrelated to rebalances, so rebalances seem like a separate problem, but maybe heartbeats are hitting an unlogged network problem.
We use a ConsumerRebalanceListener
to log and process certain rebalances, but Kafka APIs don't seem to expose data about the cause of rebalances.
The rebalances are intermittent and hard to reproduce. They happened at a message rate anywhere from 10,000 to 80,000 per second. We see no obvious errors in the logs.
Our read loop is trivial - basically "while running, poll with timeout and error handling, then enqueue received messages".
People have asked good related question, but answers didn't help us:
- Conditions in which Kafka Consumer (Group) triggers a rebalance
- What exactly IS Kafka Rebalancing?
- Continuous consumer group rebalancing with more consumers than partitions
Configuration:
- Kafka 0.10.1.0 (We've started trying 1.0.0 & don't have test results yet)
- Java 8 brokers and clients
- 2 brokers, 1 zookeeper, stable running processes & no additions
- 5 topics, with 2 somewhat busy topics. The rebalances happen on a busy one (topic "A").
- Topic A has 16 partitions and replication 2, and is created before consumers start.
- One process writes to topic A; two processes read from topic A.
- Each reader process runs 16 consumers. Some consumers are idle when 16 partitions evenly balance.
- The consumer threads do little work between polls. Message processing happens asynchronously, on a separate thread from the consumer.
- All the consumers for topic A are in the same consumer group.
- The timeout for
KafkaConsumer.poll()
is 1000 milliseconds. The configuration that affects rebalance is:
max.poll.interval.ms=50000
max.poll.records=100
request.timeout.ms=40000
session.timeout.ms=20000
We use defaults for these:
heartbeat.interval.ms=3000
- (broker)
group.max.session.timeout.ms=300000
- (broker)
group.min.session.timeout.ms=6000