1
votes

In a local (test) setup with two nodes on the same machine (using static IP configuration with port range 47500..47501), the 'second' node won't join the cluster; it issues a TcpDiscoveryJoinRequestMessage that seems to be answered by the 'first' node, yet after the network timeout occurs (20s), it shows a "Node has not been connected to topology" message and keeps sending discovery join messages that are subsequently ignored by the first node ("Ignoring join request message since node is already in topology").

The same applies to a ('real') cluster setup on (both bare metal am VM) docker machines.

Is this a known issue? Any advice on where / what to look for? Ignite issues tons of logs (TcpDiscoverySpi), but I can't see any error or warning that might explain the behaviour. Static IP configuration and customized network timeout are in effect.

Configuration is given as yml to build up a configuration bean (Spring Boot application) that in turn constructs the actual Ignite config.

grid:
  discovery:
    network-timeout: 20000
    join-timeout: 20000
    static:
      enabled: true
      addresses: 127.0.0.1:47500..47501

TcpDiscoveryVmIpFinder is in effect (as seen in the logs).

See also the relevant sections from the node logs (TcpDiscoverySpi).

1
Could you share the configuration?Denis
@Denis - Sure. I've edited the original post accordingly.ngc3370
Logs also will be helpful. Please upload full logs somewhere like pastebin.com Do you observe this problem, if nodes are started on different machines?Denis
@Denis - Relevant logs from both nodes can be found here. This also happens when starting on different machines (Docker containers).ngc3370
There is a message in the log of Node 2: Restored topology from node added message. After that TcpDiscoveryNodeAddedMessage should be sent back to the coordinator, but there is no Message has been sent to next node record for it in log. It means, that tcp-disco-msg-worker thread is stuck somewhere in the middle. Please take a thread dump of the second node after you see the message about "restored topology", and let me take a look.Denis

1 Answers

0
votes

As far as I can see, you use Ignite messaging, and some of your remoteListeners contain an IgniteSemaphore as its field, or as a part of its closure. Information about this listener is sent to all nodes in discovery messages, when they connect.

When remoteListener is deserialised, a semaphore is requested from the DataStructuresProcessor. But it hasn't been initialised yet, since node join hasn't finished. This is a deadlock, because a node cannot join until the DataStructuresProcessor is initialised and vise versa.

You can avoid this problem by initialising the semaphore lazily:

public static class ListenerHandler implements IgniteBiPredicate<UUID, Object> {
    @IgniteInstanceResource
    private Ignite ignite;

    private transient IgniteSemaphore sem;

    private IgniteSemaphore semaphore() {
        if (sem != null)
            return sem;

        sem = ignite.semaphore("sem", 1, true, true);
        return sem;
    }

    @Override public boolean apply(UUID uuid, Object o) {
        // ...
    }
}

Related issue on the bug tracker: IGNITE-3089