1
votes

I'm trying to dynamically load balance a workload. To do this I am using the Req-Router pattern.

However currently it just hangs rather than any of the messages going through.

This is my workload generator code (server.py):

import zmq

address = "ipc:///path/to/named/socket.sock"

socket = zmq.Context().socket(zmq.ROUTER)
socket.bind(address)

for i in range(1000):
    addr,_,resp = socket.recv_multipart()
    print(resp)
    socket.send_multipart([addr, b'', "Ping: " + str(i)])

And my client code (Client.java):

public static void main(String[] args) throws Exception {
        String address = "/path/to/named/socket.sock";

        System.out.println("CLIENT: Parsed Address: " + address);

        ZMQ.Context context = ZMQ.context(1);
        ZMQ.Socket socket = context.socket(SocketType.REQ);

        address = "ipc://" + address;
        socket.connect( address );
        System.out.println("CLIENT: Connected to " + address);
        for(int i = 0; i < 1000; i++){
            socket.send("Ping " + i);
            System.out.println("CLIENT: Sent.");
            String rep = new String(socket.recv());
            System.out.println("Reply " + i + ": " + rep);
        }
        socket.close();
        context.term();
    }
}

The issue is that this will print out Client: Sent however on the other end the server will never print out anything.

I have this working with a basic python client:

import zmq
from sys import argv

print("CLIENT: pinging")
"""Sends ping requests and waits for replies."""
context = zmq.Context()
sock = context.socket(zmq.REQ)
print("CLIENT: Binding to ipc://"+argv[-1])
sock.bind("ipc://"+argv[-1])
print('bound')
for i in range(1000):
    sock.send('ping %s' % i)
    rep = sock.recv()  # This blocks until we get something
    print('Ping got reply:', rep)

Which to my eye appears to do the same as the java client (though since it doesn't I presume that I am missing something).

Any help would be greatly appreciated!

1
what are the exact versions of the ZeroMQ used on the python-side, java-side?user3666197
@user3666197 they are versions python:4.3.1 and java:4.1.7, so I guess that might be causing issues?Cjen1
Could you also post the uname -a @Cjen1 ? Thanks.user3666197
@user3666197 Linux 1e1b6a3d14de 4.15.0-55-generic #60-Ubuntu SMP Tue Jul 2 18:22:20 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Additionally every time the socket gets bound it gets 755 which may be causing issues since the java process which may be in another namespace (different container) to be unable to write to it. However the python code in the same container was functional.Cjen1

1 Answers

1
votes

THE ipc:// TRANSPORT-CLASS HAS A FEW PECULIARITIES :

There are a few additional conditions for ipc:// transport-class to work:

  • system has to provide UNIX domain sockets
  • ipc:// transport-class AccessPoint-address specifier must be < 107 bytes long
  • ipc:// transport-class AccessPoint-address specifier must conform to valid O/S filesystem path syntax-rules ( absolute paths and @-abstract-namespace endpoints are also subject to other constraints )
  • ipc:// transport-class AccessPoint-address-endpoint must have been already created within the operating system namespace by assigning it to a socket with a successful .bind()
  • ipc:// transport-class AccessPoint-address-endpoint must be write-able for the process owning the calling Context()-instance ( containers are suspect for both resources-isolation and resource-proxy/abstraction-provisioniong, that do not permit to access the actual hosting O/S resources as if these were normally used by similar processes running-as-usual "outside" the horizon-of-abstraction of the "isolating"-container.

THE NEXT STEP : isolate a root-cause to be able to move forwards

Even when your code above uses a blocking-mode of .recv()-s and does not explicitly test any potential error-states, the first next step for isolating the root-cause is now to setup the same REQ/ROUTER using tcp:// transport-class to disambiguate whether the otherwise irreproducible collision relates to the choice of the transport-class or whether it has some other root-cause.

There also ought be fair resources-management steps included, so as to gracefully release resources way before the process gets exited (not speaking about the impacts of forcefully terminated processes, that hang resources and disallow their re-use upon re-running the intended code ):

  • explicit aSocketINSTANCE.setsockopt( ZMQ_LINGER, 0 )

  • explicit aSocketINSTANCE.{ close(), .unbind() }

  • explicit aContextINSTANCE.term()

and some additional preventive steps and performance or security tweaking may still get needed using ZMQ_IMMEDIATE, {UID|PID|GID}-based access filtering of processes that may and can .bind()/.connect(), now deprecated and using the (originaly "for tcp"-only) ZAP-API services instead, to name just a few further directions of design polishing.

Epilogue :

Different wrappers may use not only different ZeroMQ versions, but also have different resources-handling strategies. Thus the python-one may work while the java-one need not. As an example, the .getsockopt( ZMQ_USE_FD ) may show different outputs, based on wrapper's authors design decisions.


In case someone might like to read more about ZeroMQ internals here or perhaps to just have a general view from the orbit-high perspective as in "ZeroMQ Hierarchy in less than a Five Seconds", feel free to click-through and gather further details from the pieces of experience of the ease of life in accord with the rules of the Zen-of-Zero.