1
votes

We have setup a zookeeper quorum (3 nodes) and 3 kafka brokers. The producers can't able to send record to kafka --- data loss. During investigation, we (can still) SSH to that broker and observed that the broker disk is full. We deleted topic logs to clear some disk space and the broker function as expected again.

Given that we can still SSH to that broker, (we can't see the logs right now) but I assume that zookeeper can hear the heartbeat of that broker and didn't consider it down? What is the best practice to handle such events?

1
serverfault.com would be a better forum for this question. StackOverflow is meant for programming questions while ServerFault covers admin and network-related questions and this is more of an admin-type question.tk421

1 Answers

3
votes

The best practice is to avoid this from happening!

You need to monitor the disk usage of your brokers and have alerts in advance in case available disk space runs low.

You need to put retention limits on your topics to ensure data is deleted regularly.

You can also use Topic Policies (see create.topic.policy.class.name) to control how much retention time/size is allowed when creating/updating topics to ensure topics can't fill your disk.

The recovery steps you did are ok but you really don't want to fill the disks to keep your cluster availability high.