I'm using a single node setup of Kafka 2.0.0 and experiencing weird hangs in specific times during its life-cycle.
I have a consumer connected 24/7 and at some point it drops throwing these error messages:
2018/11/27 22:34:33 Consumer error: kafkaa:9094/1001: 1 request(s) timed out: disconnect (<nil>)
%3|1543358073.459|ERROR|rdkafka#consumer-1| [thrd:app]: rdkafka#consumer-1: 1/1 brokers are down: Local: All broker connections are down
runtime stack:
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x7fb3c5fdedb4]
runtime.throw(0x9749e0, 0x2a)
runtime.sigpanic()
goroutine 1 [syscall]:
/usr/local/go/src/runtime/signal_unix.go:374 +0x2f2
Although nothing is being done on the application side, the consumer is getting disconnected unexpectedly.
I traced back the Kafka logs to the times where the disconnections start occurring:
[2018-11-27 22:30:37,905] INFO [GroupMetadataManager brokerId=1001] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2018-11-27 22:33:36,037] TRACE [Controller id=1001] Leader imbalance ratio for broker 1001 is 0.0 (kafka.controller.KafkaController)
[2018-11-27 22:33:36,037] TRACE [Controller id=1001] Checking need to trigger auto leader balancing (kafka.controller.KafkaController)
[2018-11-27 22:33:36,037] DEBUG [Controller id=1001] Topics not in preferred replica for broker 1001 Map() (kafka.controller.KafkaController)
[2018-11-27 22:33:36,037] DEBUG [Controller id=1001] Preferred replicas by broker Map(1001 -> Map(__consumer_offsets-22 -> Vector(1001), __consumer_offsets-30 -> Vector(1001), __consumer_offsets-8 -> Vector(1001), __consumer_offsets-21 -> Vector(1001), __consumer_offsets-4 -> Vector(1001), __consumer_offsets-27 -> Vector(1001), __consumer_offsets-7 -> Vector(1001), pcap-input-0 -> Vector(1001), __consumer_offsets-9 -> Vector(1001), __consumer_offsets-46 -> Vector(1001), __consumer_offsets-25 -> Vector(1001), __consumer_offsets-35 -> Vector(1001), __consumer_offsets-41 -> Vector(1001), __consumer_offsets-33 -> Vector(1001), __consumer_offsets-23 -> Vector(1001), __consumer_offsets-49 -> Vector(1001), _schemas-0 -> Vector(1001), pcap-output-0 -> Vector(1001), __consumer_offsets-47 -> Vector(1001), __consumer_offsets-16 -> Vector(1001), __consumer_offsets-28 -> Vector(1001), __consumer_offsets-31 -> Vector(1001), __consumer_offsets-36 -> Vector(1001), __consumer_offsets-42 -> Vector(1001), __consumer_offsets-3 -> Vector(1001), __consumer_offsets-18 -> Vector(1001), __consumer_offsets-37 -> Vector(1001), __consumer_offsets-15 -> Vector(1001), __consumer_offsets-24 -> Vector(1001), pcap-input-error-0 -> Vector(1001), __consumer_offsets-38 -> Vector(1001), __consumer_offsets-17 -> Vector(1001), __consumer_offsets-48 -> Vector(1001), __confluent.support.metrics-0 -> Vector(1001), __consumer_offsets-19 -> Vector(1001), __consumer_offsets-11 -> Vector(1001), __consumer_offsets-13 -> Vector(1001), __consumer_offsets-2 -> Vector(1001), __consumer_offsets-43 -> Vector(1001), __consumer_offsets-6 -> Vector(1001), __consumer_offsets-14 -> Vector(1001), __consumer_offsets-20 -> Vector(1001), __consumer_offsets-0 -> Vector(1001), __consumer_offsets-44 -> Vector(1001), pcaps-output-failures-memsql-0 -> Vector(1001), __consumer_offsets-39 -> Vector(1001), __consumer_offsets-12 -> Vector(1001), __consumer_offsets-45 -> Vector(1001), __consumer_offsets-1 -> Vector(1001), __consumer_offsets-5 -> Vector(1001), __consumer_offsets-26 -> Vector(1001), __consumer_offsets-29 -> Vector(1001), __consumer_offsets-34 -> Vector(1001), __consumer_offsets-10 -> Vector(1001), pcaps-output-failures-elastic-0 -> Vector(1001), __consumer_offsets-32 -> Vector(1001), __consumer_offsets-40 -> Vector(1001))) (kafka.controller.KafkaController)
[2018-11-27 22:48:10,422] WARN Attempting to send response via channel for which there is no open connection, connection id 172.17.0.28:9094-172.17.0.27:40266-160 (kafka.network.Processor)
[2018-11-27 22:48:10,422] WARN Client session timed out, have not heard from server in 849630ms for sessionid 0x10001e9d5c00006 (org.apache.zookeeper.ClientCnxn)
Resource metrics show the broker spikes up to 100% CPU at this point (22:34 to be exact).
What is the broker doing here? how does this justify a 100% CPU consumption?