4
votes

I run a rabbit HA cluster with 3 nodes and a classic AWS load-balancer(LB) in front of them. There are two apps, one that publishes and the other one that consumes through the LB. enter image description here When publisher app starts sending 3 million messages, after short period of time its connection is put into Flow Control state. After the publishing is finished, in publisher app logs I can see that all 3 million messages are sent. On the other hand in consumer app log I can only see 500K - 1M messages (varies between runs), which means that the large number of messages is lost.

So what is happening is that in the middle of a run, classic LB decides to change its IP address or drop connections, thus loosing a lot of messages (see my update for more details).

The issue does not occur if I skip LB and hit the nodes directly, doing load-balancing on app side. Of course in this case I lose all the benefits of ELB.

My question are:

  • Why is LB changing IP addresses and dropping connections, is that related to high message rate from publisher or Flow Control state?
  • How to configure LB, so that this issue doesn't occur?

UPDATE:

This is my understanding what is happening: I use AMQP 0-9-1 and publish without 'publish confirms', so message is considered sent as soon as it's put on a wire. Also, the connection on rabbitmq node is between LB and a node, not Publisher app and a node.

  1. Before the communication enters Flow Control, messages are passed from LB to a node immediately enter image description here

  2. Then the connection between LB and a node enters Flow Control, Publisher App connection is not blocked and thus it continues to publish at the same rate. That causes messages to pile up on LB. enter image description here

  3. Then LB decides to change IP(s) or drop the connection for whatever reasons and create a new one, causing all the piled messages to be lost. This is clearly visible from the RabbitMQ logs:

    =WARNING REPORT==== 6-Jan-2018::10:35:50 === closing AMQP connection <0.30342.375> (10.1.1.250:29564 -> 10.1.1.223:5672): client unexpectedly closed TCP connection

    =INFO REPORT==== 6-Jan-2018::10:35:51 === accepting AMQP connection <0.29123.375> (10.1.1.22:1886 -> 10.1.1.223:5672)

enter image description here

2

2 Answers

4
votes

The solution is to use AWS network LB. The network LB is going to create a connection between Publisher App and rabbitmq node. So if the connection is blocked or dropped Publisher is going to be aware of that and act accordingly. I have run the same test with 3M messages and not the single message is lost.

enter image description here

In the AWS docs, there's this line which explains the behaviour:

Preserve source IP address Network Load Balancer preserves the client side source IP allowing the back-end to see the IP address of the client. This can then be used by applications for further processing.

From: https://aws.amazon.com/elasticloadbalancing/details/

3
votes

ELBs will change their addresses when they scale in reaction to traffic. New nodes come up, and appear in DNS, and then old nodes may go away eventually, or they may stay online.

It increases capacity by utilizing either larger resources (resources with higher performance characteristics) or more individual resources. The Elastic Load Balancing service will update the Domain Name System (DNS) record of the load balancer when it scales so that the new resources have their respective IP addresses registered in DNS. The DNS record that is created includes a Time-to-Live (TTL) setting of 60 seconds, with the expectation that clients will re-lookup the DNS at least every 60 seconds. (emphasis added)

— from “Best Practices in Evaluating Elastic Load Balancing”

You may find more useful information in that "best practices" guide, including the concept of pre-warming a balancer with the help of AWS support, and how to ramp up your test traffic in a way that the balancer's scaling can keep up.

The behavior of a classic ELB is automatic, and not configurable by the user.

But it also sounds as if you have configuration issues with your queue, because it seems like it should be more resilient to dropped connections.

Note also that an AWS Network Load Balancer does not change its IP addresses and does not need to scale by replacing resources the way ELB does, because unlike ELB, it doesn't appear to run on hidden instances -- it's part of the network infrastructure, or at least appears that way. This might be a viable alternative.