Kafka broker hangs for unknown reason

Question

I'm using a single node setup of Kafka 2.0.0 and experiencing weird hangs in specific times during its life-cycle.

I have a consumer connected 24/7 and at some point it drops throwing these error messages:

2018/11/27 22:34:33 Consumer error: kafkaa:9094/1001: 1 request(s) timed out: disconnect (<nil>)
%3|1543358073.459|ERROR|rdkafka#consumer-1| [thrd:app]: rdkafka#consumer-1: 1/1 brokers are down: Local: All broker connections are down
runtime stack:
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x7fb3c5fdedb4]

runtime.throw(0x9749e0, 0x2a)
runtime.sigpanic()
goroutine 1 [syscall]:
    /usr/local/go/src/runtime/signal_unix.go:374 +0x2f2

Although nothing is being done on the application side, the consumer is getting disconnected unexpectedly.

I traced back the Kafka logs to the times where the disconnections start occurring:

[2018-11-27 22:30:37,905] INFO [GroupMetadataManager brokerId=1001] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2018-11-27 22:33:36,037] TRACE [Controller id=1001] Leader imbalance ratio for broker 1001 is 0.0 (kafka.controller.KafkaController)
[2018-11-27 22:33:36,037] TRACE [Controller id=1001] Checking need to trigger auto leader balancing (kafka.controller.KafkaController)
[2018-11-27 22:33:36,037] DEBUG [Controller id=1001] Topics not in preferred replica for broker 1001 Map() (kafka.controller.KafkaController)
[2018-11-27 22:33:36,037] DEBUG [Controller id=1001] Preferred replicas by broker Map(1001 -> Map(__consumer_offsets-22 -> Vector(1001), __consumer_offsets-30 -> Vector(1001), __consumer_offsets-8 -> Vector(1001), __consumer_offsets-21 -> Vector(1001), __consumer_offsets-4 -> Vector(1001), __consumer_offsets-27 -> Vector(1001), __consumer_offsets-7 -> Vector(1001), pcap-input-0 -> Vector(1001), __consumer_offsets-9 -> Vector(1001), __consumer_offsets-46 -> Vector(1001), __consumer_offsets-25 -> Vector(1001), __consumer_offsets-35 -> Vector(1001), __consumer_offsets-41 -> Vector(1001), __consumer_offsets-33 -> Vector(1001), __consumer_offsets-23 -> Vector(1001), __consumer_offsets-49 -> Vector(1001), _schemas-0 -> Vector(1001), pcap-output-0 -> Vector(1001), __consumer_offsets-47 -> Vector(1001), __consumer_offsets-16 -> Vector(1001), __consumer_offsets-28 -> Vector(1001), __consumer_offsets-31 -> Vector(1001), __consumer_offsets-36 -> Vector(1001), __consumer_offsets-42 -> Vector(1001), __consumer_offsets-3 -> Vector(1001), __consumer_offsets-18 -> Vector(1001), __consumer_offsets-37 -> Vector(1001), __consumer_offsets-15 -> Vector(1001), __consumer_offsets-24 -> Vector(1001), pcap-input-error-0 -> Vector(1001), __consumer_offsets-38 -> Vector(1001), __consumer_offsets-17 -> Vector(1001), __consumer_offsets-48 -> Vector(1001), __confluent.support.metrics-0 -> Vector(1001), __consumer_offsets-19 -> Vector(1001), __consumer_offsets-11 -> Vector(1001), __consumer_offsets-13 -> Vector(1001), __consumer_offsets-2 -> Vector(1001), __consumer_offsets-43 -> Vector(1001), __consumer_offsets-6 -> Vector(1001), __consumer_offsets-14 -> Vector(1001), __consumer_offsets-20 -> Vector(1001), __consumer_offsets-0 -> Vector(1001), __consumer_offsets-44 -> Vector(1001), pcaps-output-failures-memsql-0 -> Vector(1001), __consumer_offsets-39 -> Vector(1001), __consumer_offsets-12 -> Vector(1001), __consumer_offsets-45 -> Vector(1001), __consumer_offsets-1 -> Vector(1001), __consumer_offsets-5 -> Vector(1001), __consumer_offsets-26 -> Vector(1001), __consumer_offsets-29 -> Vector(1001), __consumer_offsets-34 -> Vector(1001), __consumer_offsets-10 -> Vector(1001), pcaps-output-failures-elastic-0 -> Vector(1001), __consumer_offsets-32 -> Vector(1001), __consumer_offsets-40 -> Vector(1001))) (kafka.controller.KafkaController)
[2018-11-27 22:48:10,422] WARN Attempting to send response via channel for which there is no open connection, connection id 172.17.0.28:9094-172.17.0.27:40266-160 (kafka.network.Processor)
[2018-11-27 22:48:10,422] WARN Client session timed out, have not heard from server in 849630ms for sessionid 0x10001e9d5c00006 (org.apache.zookeeper.ClientCnxn)

Resource metrics show the broker spikes up to 100% CPU at this point (22:34 to be exact).

What is the broker doing here? how does this justify a 100% CPU consumption?

freakman freakman · Accepted Answer · 2018-11-28T12:40:02

your consumer app is not pooling for a long time so kafka broker thinks there is something wrong with it - maybe network partition happened or app just crashed. It's done so kafka broker can do rebalancing and assign particular partition to different consumer.

You may want to tweak zookeeper.session.timeout.ms heartbeat.interval.ms session.timeout.ms and max.poll.interval.ms to work around it. Take a look at http://kafka.apache.org/20/documentation.html, search for heartbeat. I see you are using some go library - please verify how you can tweak heartbeats in it.

Kafka broker hangs for unknown reason

1 Answers