I run a rabbit HA cluster with 3 nodes and a classic AWS load-balancer(LB) in front of them. There are two apps, one that publishes and the other one that consumes through the LB. When publisher app starts sending 3 million messages, after short period of time its connection is put into Flow Control state. After the publishing is finished, in publisher app logs I can see that all 3 million messages are sent. On the other hand in consumer app log I can only see 500K - 1M messages (varies between runs), which means that the large number of messages is lost.
So what is happening is that in the middle of a run, classic LB decides to change its IP address or drop connections, thus loosing a lot of messages (see my update for more details).
The issue does not occur if I skip LB and hit the nodes directly, doing load-balancing on app side. Of course in this case I lose all the benefits of ELB.
My question are:
- Why is LB changing IP addresses and dropping connections, is that related to high message rate from publisher or Flow Control state?
- How to configure LB, so that this issue doesn't occur?
UPDATE:
This is my understanding what is happening: I use AMQP 0-9-1 and publish without 'publish confirms', so message is considered sent as soon as it's put on a wire. Also, the connection on rabbitmq node is between LB and a node, not Publisher app and a node.
Before the communication enters Flow Control, messages are passed from LB to a node immediately
Then the connection between LB and a node enters Flow Control, Publisher App connection is not blocked and thus it continues to publish at the same rate. That causes messages to pile up on LB.
Then LB decides to change IP(s) or drop the connection for whatever reasons and create a new one, causing all the piled messages to be lost. This is clearly visible from the RabbitMQ logs:
=WARNING REPORT==== 6-Jan-2018::10:35:50 === closing AMQP connection <0.30342.375> (10.1.1.250:29564 -> 10.1.1.223:5672): client unexpectedly closed TCP connection
=INFO REPORT==== 6-Jan-2018::10:35:51 === accepting AMQP connection <0.29123.375> (10.1.1.22:1886 -> 10.1.1.223:5672)