2
votes

I am building a simple gen_server module which monitors activity of multiple remote nodes

When a remote node registers, this module monitors the node with erlang:monitor_node(Node, true). This is registered only once per node (confirmed with logs)

and in a handle_info/2 callback of gen_server, it catches {nodedown, Node} message and demonitors the node with erlang:monitor_node(Node, false). I expect to receive this message only once: when the remote node is down.

When I was testing the module, I found that when a remote node goes down, hundreds of {nodedown, Node} messages (the number varies from few hundreds to few thousands) are sent to the gen_server.

Why are multiple messages sent by monitor_node? How can I prevent this behaviour?

EDIT: here is (a part of) the source code

register_node(#node_info{node = NodeName} = NodeInfo) ->
    case mnesia:read(node_info, NodeName) of
        [] ->
            monitor_node(NodeName, true),
            error_logger:info_msg("node ~p registered", [NodeName]);
        [_OldInfo] ->
            error_logger:trace_msg("info of node ~p updated", [NodeName])
    end,
    mnesia:write(NodeInfo).

handle_cast({register_node, #node_info{} = NodeStatus}, Timer) ->
    case mnesia:transaction(fun register_node/1, [NodeStatus]) of
        {aborted, Reason} ->
            error_logger:warning_msg("transaction register_node failed: ~p", [Reason]);
        _ ->
        ok
    end,
    {noreply, Timer};
handle_cast({shutdown_node, #node_info{} = NodeStatus}, Timer) ->
    case mnesia:dirty_delete_object(NodeStatus) of
        {aborted, Reason} ->
            error_logger:warning_msg("transaction shutdown_node failed: ~p", [Reason]);
        _ ->
        ok
    end,
    {noreply, Timer};
handle_cast(Message, Timer) ->
    error_logger:warning_msg("~p: received unknown message ~p", [?MODULE, Message]),
    {noreply, Timer}.

handle_info({nodedown, Node}, Timer) ->
    monitor_node(Node, false),
    error_logger:info_msg("~p: node ~p down", [?MODULE, Node]),
    mnesia:transaction(fun mnesia:delete/3, [node_info, Node, write]),
    {noreply, Timer};
handle_info(Message, Timer) ->
    error_logger:warning_msg("~p: received unknown message ~p", [?MODULE, Message]),
    {noreply, Timer}.
1
Please post your source code here. - Chen Yu

1 Answers

5
votes

You have done monitor_node(NodeName, true) **INSIDE** the mnesia transaction.

I think that because monitor_node will involve (I/O operation) message communication internally. It is not suitable to put this line inside transation. It maybe send handreds of 'registered' message to the involved node. So that when the node became down, handreds of 'nodedown' messages have been received.

    If a process has made two calls to monitor_node(Node, true) and Node terminates, 
**two nodedown messages are delivered to the process.** If there is no connection 
to Node, there will be an attempt to create one. If this fails, a nodedown 
message is delivered.

Please move the line out of transaction or just use "CASE" expression, and try again.

register_node(#node_info{node = NodeName} = NodeInfo) ->
    case mnesia:read(node_info, NodeName) of
        [] ->
            monitor_node(NodeName, true),
            error_logger:info_msg("node ~p registered", [NodeName]);
        [_OldInfo] ->
            error_logger:trace_msg("info of node ~p updated", [NodeName])
    end,
    mnesia:write(NodeInfo).
handle_cast({register_node, #node_info{} = NodeStatus}, Timer) ->
    case mnesia:transaction(fun register_node/1, [NodeStatus]) of
        {aborted, Reason} ->
            error_logger:warning_msg("transaction register_node failed: ~p", [Reason]);
        _ ->
        ok
    end,
    {noreply, Timer};

explanation of side-effect in mnesia transaction

Mnesia dynamically sets and releases locks as transactions execute, therefore, it is very dangerous to execute code with transaction side-effects. In particular, a receive statement inside a transaction can lead to a situation where the transaction hangs and never returns, which in turn can cause locks not to release. This situation could bring the whole system to a standstill since other transactions which execute in other processes, or on other nodes, are forced to wait for the defective transaction.