2
votes

I have kafka cluster of 15 nodes: kafka1.com:9092,kafkaN.com:9092...,kafka10.com:9092

Also I have application of 10 nodes which consumes around 100 topics(from 2 to 240 partitions). For each topic application creates separate instance of KafkaStreams object with some transform logic.

Before I had such amount of consumer nodes, I did not have any issues, but with increasing nodes count during deploy this application started to load kafka cluster heavily. But if whole cluster loses in average ~2% of CPU during deploy, first node in broker list kafka1.com:9092 loses around 50-60% of CPU and key metric Average fraction of time the request handler threads are idle becomes very low for this broker.

This behavior absolutely the same if kafka under load (huge throughput), or no messages processed at all.

I've tried to play with settings, but each time I see same metrics =( I even updated kafka-clients to 2.3.0 version Broker is 1.1.1 version

I think it's might be connected to metadata fetching, since I see no other way Kafka Streams (consumer, producer, admin clients) requests another data from first broker in list.

But why it loads broker so much?

1

1 Answers

2
votes

Without specific details and monitoring metrics it's hard to tell what exactly is the root cause.

However, from experience, these are main reasons of uneven Kafka load distribution:

  1. Uneven partition distribution between Kafka broker nodes

    This situation is actually described in official Kafka documentation. Even though situation is described for expanding cluster, it also might happen during regular operation of big cluster. Basically, it's very likely, that Kafka distributes partitions between broker nodes unevenly.

    So, it might be that kafka1.com:9092 is leader for big fraction of partitions in the cluster, and therefore, has increased CPU/disk/network usage (because biggest fraction of consumers connects to it + overhead spent on replication).

    The solution to this problem is explicitly reassigning partitions.

  2. Uneven leader election (and repeated leader re-balancing)

    This usually happens together with uneven partitions distribution. Basically, if leader nodes are overwhelmed, Kafka will decide to re-elect the leaders. However, since partitions are distributed unevenly - it doesn't help, causing more leader re-elections, and thus - increasing load on the cluster.

    The solution to this problem is increasing replication factor (together with partition reassignment).

    It might look counter-intuitive (since, increasing replication factor increases replication overhead). However, this hints Kafka to distribute data between more nodes, thus helping to off-load overwhelmed ones.

  3. Uneven message distribution between partitions (and corresponding broker nodes)

    Basically, if application uses DefaultPartitioner (which is usually the case), you might not get round-robin distribution of messages between partitions. According Kafka FAQ: if there are fewer producers than partitions, at a given point of time, some partitions may not receive any data. As a consequence - some partitions (and corresponding broker nodes) might be over-loaded with data, and others - under-loaded.

Also, if cluster overall is busy when there is no actual input messages - it's usually caused by Kafka trying to sync all replicas with offsets and ACKs.