In a local (test) setup with two nodes on the same machine (using static IP configuration with port range 47500..47501), the 'second' node won't join the cluster; it issues a TcpDiscoveryJoinRequestMessage
that seems to be answered by the 'first' node, yet after the network timeout occurs (20s), it shows a "Node has not been connected to topology" message and keeps sending discovery join messages that are subsequently ignored by the first node ("Ignoring join request message since node is already in topology").
The same applies to a ('real') cluster setup on (both bare metal am VM) docker machines.
Is this a known issue? Any advice on where / what to look for? Ignite issues tons of logs (TcpDiscoverySpi
), but I can't see any error or warning that might explain the behaviour. Static IP configuration and customized network timeout are in effect.
Configuration is given as yml to build up a configuration bean (Spring Boot application) that in turn constructs the actual Ignite config.
grid:
discovery:
network-timeout: 20000
join-timeout: 20000
static:
enabled: true
addresses: 127.0.0.1:47500..47501
TcpDiscoveryVmIpFinder
is in effect (as seen in the logs).
See also the relevant sections from the node logs (TcpDiscoverySpi
).
Restored topology from node added message
. After thatTcpDiscoveryNodeAddedMessage
should be sent back to the coordinator, but there is noMessage has been sent to next node
record for it in log. It means, thattcp-disco-msg-worker
thread is stuck somewhere in the middle. Please take a thread dump of the second node after you see the message about "restored topology", and let me take a look. – Denis