8
votes

We are seeing unexpected rebalances in Java Kafka consumers, described below. Do these problems sound familiar to anybody? Any tips on APIs or debug techniques to figure out rebalance causes?

  1. Two processes are reading a topic. Sometimes all partitions on the topic get rebalanced to a single reader process. After restarting both processes, partitions get evenly balanced.

  2. Two processes are reading a topic. Sometimes a long sequence of rebalances bounces partitions from reader to reader. We call pause/resume on consumers for backpressure, which should prevent this.

  3. Two processes are reading a topic. Sometimes a rebalance happens when it looks like both processes are reading ok. Afterwards, reading works ok, but it's a hiccup in processing.

We expect partitions would not rebalance without also seeing some cause or failure.

Sometimes poll() gets stuck (exceeds the timeout) and we use wakeup() and close(), then create new consumers. Sometimes coordinator heartbeat threads keep running after consumers are closed (we've seen thousands). The timing seems unrelated to rebalances, so rebalances seem like a separate problem, but maybe heartbeats are hitting an unlogged network problem.

We use a ConsumerRebalanceListener to log and process certain rebalances, but Kafka APIs don't seem to expose data about the cause of rebalances.

The rebalances are intermittent and hard to reproduce. They happened at a message rate anywhere from 10,000 to 80,000 per second. We see no obvious errors in the logs.

Our read loop is trivial - basically "while running, poll with timeout and error handling, then enqueue received messages".

People have asked good related question, but answers didn't help us:

Configuration:

  1. Kafka 0.10.1.0 (We've started trying 1.0.0 & don't have test results yet)
  2. Java 8 brokers and clients
  3. 2 brokers, 1 zookeeper, stable running processes & no additions
  4. 5 topics, with 2 somewhat busy topics. The rebalances happen on a busy one (topic "A").
  5. Topic A has 16 partitions and replication 2, and is created before consumers start.
  6. One process writes to topic A; two processes read from topic A.
  7. Each reader process runs 16 consumers. Some consumers are idle when 16 partitions evenly balance.
  8. The consumer threads do little work between polls. Message processing happens asynchronously, on a separate thread from the consumer.
  9. All the consumers for topic A are in the same consumer group.
  10. The timeout for KafkaConsumer.poll() is 1000 milliseconds.
  11. The configuration that affects rebalance is:

    1. max.poll.interval.ms=50000
    2. max.poll.records=100
    3. request.timeout.ms=40000
    4. session.timeout.ms=20000

      We use defaults for these:

    5. heartbeat.interval.ms=3000
    6. (broker) group.max.session.timeout.ms=300000
    7. (broker) group.min.session.timeout.ms=6000
1
We are also suffering from same problem. Kafka 0.10.0.1, 12 topics each with 10 partitions. Different CGs for every topic. Sometimes some CGs rebalance for more than 5 minutes. After process is restarted some CGs take upto 10 minutes to start consuming. Not finding any solution since last 2 months, no help anywhereShades88
Are rebalances quick enough? Asking because I've experiences issues with group coordinator due to log cleaner issues. Have you considered upgrading to latest release of this minor (0.10.2.3)?Lior Chaga

1 Answers

0
votes

Check the gc log,and make sure there is not full gc frequently which will prevent heartbeat thread working.