1
votes

I just started learning Erlang and this is a module from a test project of mine. I'm doing it so that I can understand a little better how the supervision tree works, to practice fail-fast code and some programming best practices.

The udp_listener process listens for UDP messages. It's role is to listen to communication requests from other hosts in the network and contact them through TCP using the port number defined in the UDP message.

The handle_info(...) function is called every time an UDP message is received by the socket, it decodes the UDP message and passes it to the tcp_client process.

From what I understood the only failure point in my code is the decode_udp_message(Data) called sometime inside handle_info(...).

When this functions fails, is the whole udp_listener process is restarted? Should I keep this from happening?

Shouldn't just the handle_info(...) function silently die without affecting the udp_listener process?

How should I log an exception on decode_udp_message(Data)? I would like to register somewhere the host and it's failed message.

-module(udp_listener).
-behaviour(gen_server).
-export([init/1, handle_call/3, handle_cast/2, 
         handle_info/2, terminate/2, code_change/3]).

%% ====================================================================
%% API functions
%% ====================================================================

-export([start_link/1]).

start_link(Port) ->
    gen_server:start_link({local, ?MODULE}, ?MODULE, Port, []).

%% ====================================================================
%% Behavioural functions 
%% ====================================================================

%% init/1
%% ====================================================================
-spec init(Port :: non_neg_integer()) -> Result when
    Result :: {ok, Socket :: port()}
            | {stop, Reason :: term()}.
%% ====================================================================
init(Port) ->
    SocketTuple = gen_udp:open(Port, [binary, {active, true}]),
    case SocketTuple of
        {ok, Socket}        -> {ok, Socket};
        {error, eaddrinuse} -> {stop, udp_port_in_use};
        {error, Reason}     -> {stop, Reason}
    end.

% Handles "!" messages from the socket
handle_info({udp, Socket, Host, _Port, Data}, State) -> Socket = State,
    handle_ping(Host, Data),
    {noreply, Socket}.

terminate(_Reason, State) -> Socket = State,
    gen_udp:close(Socket).

handle_cast(_Request, State)        -> {noreply, State}.
handle_call(_Request, _From, State) -> {noreply, State}.
code_change(_OldVsn, State, _Extra) -> {ok, State}.

%% ====================================================================
%% Internal functions
%% ====================================================================

handle_ping(Host, Data) ->
    PortNumber = decode_udp_message(Data),
    contact_host(Host, PortNumber).

decode_udp_message(Data) when is_binary(Data) ->
    % First 16 bits == Port number
    <<PortNumber:16>> = Data,
    PortNumber.

contact_host(Host, PortNumber) ->
    tcp_client:connect(Host, PortNumber).

Result

I've changed my code based on your answers, decode_udp_message is gone because handle_ping does what I need.

handle_ping(Host, <<PortNumber:16>>) ->
    contact_host(Host, PortNumber);

handle_ping(Host, Data) ->
    %% Here I'll log the invalid datagrams but the process won't be restarted

I like the way it is now, by adding the following code I could handle protocol changes in the future without losing backwards compatibility with old servers:

handle_ping(Host, <<PortNumber:16, Foo:8, Bar:32>>) ->
    contact_host(Host, PortNumber, Foo, Bar);
handle_ping(Host, <<PortNumber:16>>) ->
    ...

@Samuel-Rivas

tcp_client is another gen_server with it's own supervisor, it will handle its own failures.

-> Socket = State in now only present in the terminate function. gen_udp:close(Socket). is easier on the eyes.

2

2 Answers

3
votes

Your decode_message is not the only point of failure. contact_host can most likely fail too, but you are either ignoring the error tuple or handling that failure in your tcp_client implementation.

That aside, you approach to error handling would work provided that your udp_listener is started by a supervisor with the correct strategy. If Data is not exactly 16 bits then the matching will fail and the process will crash with a badmatch exception. Then the supervisor will start a new one.

Many online style guides will advertise just that style. I think they are wrong. Even though failing there right away is just what you want, it doesn't mean you cannot provide a better reason than badmatch. So I would write some better error handling there. Usually, I would throw an informative tuple, but for gen servers that is tricky because they wrap every call within a catch which will turn throws into valid values. That is unfortunate, but is a topic for other long explanation, so for practical purposes I will throw errors here. A third alternative is just to use error tuples ({ok, Blah} | {error, Reason}), however that gets complicated fast. Which option to use is also topic for a long explanation/debate, so for now I'll just continue with my own approach.

Getting back to your code, if you want proper and informative error management, I would do something in this lines with the decode_udp_message function (preserving your current semantics, see at the end of this response, since I think they are not what you wanted):

decode_udp_message(<<PortNumber:16>>) ->
    PortNumber;
decode_udp_message(Ohter) ->
    %% You could log here if you want or live with the crash message if that is good enough for you
    erlang:error({invalid_udp_message, {length, byte_size(Other)}}).

As you have said, this will take the entire UDP connection with it. If the process is restarted by a supervisor, then it will reconnect (which will probably cause problems unless you use the reuseaddr sockopt). That will be fine unless you are planning to fail many times per second and opening the connection becomes a burden. If that is the case you have several options.

  • Assume that you can control all your points of failure and handle errors there without crashing. For example, in this scenario you could just ignore malformed messages. This is probably fine in simple scenarios like this, but is unsafe as it is easy to overlook points of failure.
  • Separate the concerns that you want to keep fault tolerant. In this case I would have one process to hold the connections and another one to decode the messages. For the latter you could use a "decoding server" or spawn one per message depending on your preferences and the load you are expecting.

Summary:

  • Failing as soon as your code finds something outside what is the normal behaviour is a good idea, but remember to use supervisors to restore functionality
  • Just-let-it-crash, is a bad practice in my experience, you should strive for clear error reasons, that will make your life easier when your systems grow
  • Processes are your tool to isolate the scope for failure recovery, if you don't want one system to be affected by failures/restarts, just spawn off processes to handle the complexity that you want to isolate
  • Sometimes performance gets in the way and you'll need to compromise and handle errors in place instead of let processes crash, but as usual, avoid premature optimisation in this sense

Some notes about your code unrelated to error handling:

  • Your comment in the decode_udp_message seems to imply that you want to parse the first 16 bits, but you are actually forcing Data to be exactly 16 bits.
  • In some of your calls you do something like -> Socket = State, that indentation is probably bad style, and also the renaming of the variable is somewhat unnecessary. You can either just change State for Socket in the function head or, if you want to make clear that your state is a socket write your function head like ..., Socket = State) ->
4
votes

I think that "let it crash" has often been misinterpreted as "do not handle errors" (a much stronger and stranger suggestion). And the answer to your question ("should I handle errors or not") is "it depends".

One concern with error handling is the user experience. You're never going to want to throw a stack trace an supervision tree at you users. Another concern, as Samuel Rivas points out, is that debugging from just a crashed process can be painful (especially for a beginner).

Erlang's design favors servers with non-local clients. In this architecture, the clients must be able to handle the server suddenly becoming unavailable (your wifi connection drops just as you click the "post" button on S.O.), and the servers must be able to handle sudden drop-outs from clients. In this context, I would translate "let it crash" as "since all parties can handle the server vanishing and coming back, why not use that as the error handler? Instead of writing tons of lines of code to recover from all the edge-cases (and then still missing some), just drop all the connections and return to a known-good state."

The "it depends" comes in here. Maybe it's really important to you to know who sent the bad datagram (because you're also writing the clients). Maybe the clients always want a reply (hopefully not with UDP).

Personally, I begin by writing the "success path", which includes both successful success and also the errors that I want to show clients. Everything that I didn't think of or that clients don't need to know about is then handled by the process restarting.