There are a few existing questions that discuss how to use ZeroMQ to work around the possibility of dropped messages and most have been very instructive.
Still, there is one thing that just keep troubling me about the ZMQ_DEALER socket. I have been testing with a very simple case: 1 server and 2 clients, all using a single ZMQ_DEALER socket each. The server sends messages and the clients receive them.
If the server uses socket.bind() and the clients socket.connect(), we can observe proper round-robin balancing and killing one of the clients results in the server redirecting all its messages to the remaining client. No delay, no packet loss, works beautifully.
Now if I have the clients do socket.bind(), and the server socket.connect() (still using one single socket but connect to both clients), the server behavior is affected. After killing one of the clients, instead of redirecting its traffic to the remaining one, it will keep on load balancing to both until the number of messages in the queue hits the high watermark for the dead client.
The possibility of using connect on a socket already bound lead me to think that it would be a more or less symmetrical usage, but I would be curious to both know the why of such behavior, and if there is a way to replicate the failover of bound sockets to connected ones.
EDIT: in order to make the question a little more inspiring, here is some code for you to test this behavior.
This is the dealer:
// dealer.cc
// compile using something like this: g++ dealer.cc -o dealer -lzmq
#include <zmq.hpp>
#include <unistd.h>
#include <stdint.h>
int main() {
// prepare zmq
zmq::context_t context (1);
zmq::socket_t socket (context, ZMQ_DEALER);
socket.bind ("tcp://127.0.0.1:5555");
//socket.connect("tcp://127.0.0.1:5555");
//socket.connect("tcp://127.0.0.1:5556");
zmq::message_t msg;
int64_t more;
int counter = 0;
size_t more_size = sizeof more;
bool gotClients = false;
while (true) {
// send incrementing numbers
zmq::message_t world(sizeof(int));
memcpy(world.data(), &counter, sizeof(int));
socket.send(world);
counter++;
usleep(100000);
}
return 0;
}
And this is the client:
// client.cc
// compile using something like this: g++ client.cc -o client -lzmq
#include <zmq.hpp>
#include <iostream>
int main(int argc, char ** argv) {
zmq::context_t context (1);
zmq::socket_t socket (context, ZMQ_DEALER);
socket.connect("tcp://127.0.0.1:5555");
//socket.bind(argv[1]);
zmq::message_t msg;
while (true) {
socket.recv(&msg, 0);
std::cout << "Received a message: " << *(int *)msg.data() << std::endl;
}
return 0;
}
Create 2 clients, then start the dealer, then kill one of the clients before killing the dealer. If you check the output, you can see not a single message was dropped or stuck in zmq queue limbo. The load was balanced as long as both clients were alive, and completely redirected to the remaining one when the "failure" occurred.
Now let's swap the connect()/bind () for their inverses (using the commented code). We have to let the clients know which address to bind to so they should be started with the URL as follows:
./client tcp://127.0.0.1:5555
./client tcp://127.0.0.1:5556
Then, just as previously, start the dealer and kill one of the clients. You can see that th remaining client only receives half of the dealer's messages, even after the first client was killed. My understanding is that as long as the underlying ZMQ queue is not full, the dealer will continue queuing messages for the disconnected peer (which in this example, and given the default parameters, is going to be taking a really, reaaally long time)