14
votes

I'm trying to setup a cluster of RabbitMQ servers, to get highly available queues using an active/passive server architecture. I'm following this guides:

  1. http://www.rabbitmq.com/clustering.html
  2. http://www.rabbitmq.com/ha.html
  3. http://karlgrz.com/rabbitmq-highly-available-queues-and-clustering-using-amazon-ec2/

My requirement for high availability is simple, i have two nodes (CentOS 6.4) with RabbitMQ (v3.2) and Erlang R15B03. The Node1 must be the "active", responding all requests, and the Node2 must be the "passive" node that has all the queues and messages replicated (from Node1).

To do that, i have configured the following:

  • Node1 with RabbitMQ working fine in non-cluster mode
  • Node2 with RabbitMQ working fine in non-cluster mode

The next I did was to create a cluster between both nodes: joining Node2 to Node1 (guide 1). After that I configured a policy to make mirroring of the queues (guide 2), replicating all the queues and messages among all the nodes in the cluster. This works, i can connect to any node and publish or consume message, while both nodes are available.

The problem occurs when i have a queue "queueA" that was created on the Node1 (master on queueA), and when Node1 is stopped, I can't connect to the queueA in the Node2 to produce or consume messages, Node2 throws an error saying that Node1 is not accessible (I think that queueA is not replicated to Node2, and Node2 can't be promoted as master of queueA).

The error is:

{"The AMQP operation was interrupted: AMQP close-reason, initiated by Peer, code=404, text=\"NOT_FOUND - home node 'rabbit@node1' of durable queue 'queueA' in vhost 'app01' is down or inaccessible\", classId=50, methodId=10, cause="}

The sequence of steps used is:

Node1:

1. rabbitmq-server -detached
2. rabbitmqctl start_app

Node2:

3. Copy .erlang.cookie from Node1 to Node2
4. rabbitmq-server -detached

Join the cluster (Node2):

5. rabbitmqctl stop_app
6. rabbitmqctl join_cluster rabbit@node1
7. rabbitmqctl start_app

Configure Queue mirroring policy:

8. rabbitmqctl set_policy ha-all "" '{"ha-mode":"all","ha-sync-mode":"automatic"}'

Note: The pattern used for queue names is "" (all queues).

When I run 'rabbitmqctl list_policies' and 'rabbitmqctl cluster_status' is everything ok.

Why the Node2 cannot respond if Node1 is unavailable? Is there something wrong in this setup?

5
What properties do have the queue and the messages you send? Does your cluster persist the messages? (some kind of cluster configuration when you set up a node) Also take a look here: stackoverflow.com/a/23224388/1248724Zarathustra

5 Answers

6
votes

You haven't specified the virtual host (app01) in your set_policy call, thus the policy will only apply to the default virtual host (/). This command line should work:

rabbitmqctl set_policy -p app01 ha-all "" '{"ha-mode":"all","ha-sync-mode":"automatic"}'
1
votes

In the web management console, is queueA listed as Node1 +1?

It sounds like there might be some issue with your setup. I've got a set of vagrant boxes that are pre-configured to work in a cluster, might be worth trying that and identifying issues in your setup?

1
votes

Only mirror queue which are synchronized with the master are promoted to be master, after fails. This is default behavior, but can be changed to promote-on-shutdown always.

-1
votes

Read carefully your reference

http://www.rabbitmq.com/ha.html

You could use a cluster of RabbitMQ nodes to construct your RabbitMQ broker. This will be resilient to the loss of individual nodes in terms of the overall availability of service, but some important caveats apply: whilst exchanges and bindings survive the loss of individual nodes, queues and their messages do not. This is because a queue and its contents reside on exactly one node, thus the loss of a node will render its queues unavailable.

-1
votes

Make sure that your queue is not durable or exclusive.

From the documentation (https://www.rabbitmq.com/ha.html):

Exclusive queues will be deleted when the connection that declared them is closed. For this reason, it is not useful for an exclusive queue to be mirrored (or durable for that matter) since when the node hosting it goes down, the connection will close and the queue will need to be deleted anyway.

For this reason, exclusive queues are never mirrored (even if they match a policy stating that they should be). They are also never durable (even if declared as such).

From your error message:

{"The AMQP operation was interrupted: AMQP close-reason, initiated by Peer, code=404, text=\"NOT_FOUND - home node 'rabbit@node1' of durable queue 'queueA' in vhost 'app01' is down or inaccessible\", classId=50, methodId=10, cause="}

It looks like you created a durable queue.