We have been having below issues from RabbitMQ and had been manually restarting the servers every weekend as a work around.
Network partition detected
Mnesia reports that this RabbitMQ cluster has experienced a network partition. This is a dangerous situation. RabbitMQ clusters should not be installed on networks which can experience partitions.
We have gone through other popular posts on the topic e.g. here and here
Our network is not highly reliable and occasional blips are expected but when it does come up I would have expected 1 of the 4 node RabbitMQ cluster to join the rest of cluster - as is the case with 4 nodes of Tomcat installed on same servers.
- Although the nodes on single partition continue to run independently but doesnt seem like that is a graceful recovery from failure in one node.
- We didnt have great luck with using any
rabbitmqctl
commands likerabbitmqctl cluster_status
- It used to sporadically cause the rabbitmq process to hang which needed a sudo kill to RabbitMQ process.
We are at a point of evaluating moving to Kafka or any other message broker that handles message partition well
Any thoughts on working around not needing manual RabbitMQ restarts or ability of Kafka to handle such situation is highly appreciated