How is the detection of terminated nodes in Erlang working? How is net_ticktime influencing the control of node liveness in Erlang?

Question

I set net_ticktime value to 600 seconds.

net_kernel:set_net_ticktime(600)

In Erlang documentation for net_ticktime = TickTime:

Specifies the net_kernel tick time. TickTime is given in seconds. Once every TickTime/4 second, all connected nodes are ticked (if anything else has been written to a node) and if nothing has been received from another node within the last four (4) tick times that node is considered to be down. This ensures that nodes which are not responding, for reasons such as hardware errors, are considered to be down.

The time T, in which a node that is not responding is detected:

MinT < T < MaxT where:

MinT = TickTime - TickTime / 4
MaxT = TickTime + TickTime / 4

TickTime is by default 60 (seconds). Thus, 45 < T < 75 seconds.

Note: Normally, a terminating node is detected immediately.

My Problem: My TickTime is 600 (seconds). Thus, 450 (7.5 minutes)< T < 750 seconds (12.5 minutes). Although, when I set net_ticktime to all distributed nodes in Erlang to value 600 when some node fails (eg. when I close Erlang shell) then the other nodes get message immediately and not according to definition of ticktime.

However it is noted that normally a terminating node is detected immediately but I could not find explanation (neither in Erlang documentation, or Erlang ebook or other Erlang based sources) of this immediate response principle for node termination in distributed Erlang. Are nodes in distributed environment pinged periodically with smaller intervals than net_ticktime or does the terminating node send some kind of message to other nodes before it terminates? If it does send a message are there any scenarios when upon termination node cannot send this message and must be pinged to investigate its liveliness?

Also it is noted in Erlang documentation that Distributed Erlang is not very scalable for clusters larger than 100 nodes as every node keeps links to all nodes in the cluster. Is the algorithm for investigating liveliness of nodes (pinging, announcing termination) modified with increasing size of the cluster?

One key question: do your nodes do any communication of their own between ticks (rpc's, mnesia transactions, etc.)? If so, it would be possible (and indeed, highly likely) for the VM to detect a downed node prior to the tick. — Soup d'Campbells
I tested issue for two situations: 1.) I tried only connecting remotely Erlang shells residing on two VMs with cmd net_adm:ping(), and after closing the shell cmd nodes() immediately recognized that node is down and returned [] but there was no communication between ticks. 2.) I tried this issue in my Riak app. where I killed the node that run the app. and then periodically sent commands to other nodes for their information on killed node. After timer:sleep(10) alive nodes still listed killed node as alive, but after timer:sleep(100) killed node was marked as down but 100ms < 600s. — Zuzana

Joe Joe · Accepted Answer · 2014-06-22T06:33:12

When two Erlang nodes connect, a TCP connection is made between them. The failure you are inducing would cause the underlying OS to close the connection, effectively notifying the other node very quickly.

The network tick is used to detect a connection to a distant node that appears to be up but is not actually passing traffic, such as may occur when a network event isolates a node.

If you want to simulate a failure that would require a tick to detect, use a firewall to block the traffic on the connection created when the nodes first ping.

How is the detection of terminated nodes in Erlang working? How is net_ticktime influencing the control of node liveness in Erlang?

1 Answers