Why have a supervision tree instead just one centralised supervisor?

Question

I'm learning Elixir and Erlang/OTP and would like to understand the significance of having a supervision tree in building highly available system.

I can see the importance of a supervisor in managing the lifetime of worker processes. But, I still would like to know why do some applications have the need to organise supervisors in the form of hierarchies, instead of just having a single supervisor to manage all the workers? Are there any practical benefits of having such structures that I naively overlooked?

Borrowing an example from the Programming Elixir book, in which scenario do we prefer the first structure over the second structure?

1.  MainSupervisor
    ├── StashWorker
    └── SubSupervisor
        └──SequenceWorker

2.  MainSupervisor
    ├── StashWorker
    └── SequenceWorker

Aleksei Matiushkin Aleksei Matiushkin · Accepted Answer · 2020-08-13T04:38:13

What you probably overlook is the famous “let it crash” philosophy, which makes process crashes and restarts the first-class citizen in OTP. We don’t treat process crashes as failures, but rather as an opportunity to redo it properly without the necessity to manually handle errors.

The main reason is to allow more grained control on what should have been restarted on failure. For that, we have strategies. Or, as @Andree restated it in comments:

by organizing supervisions in hierarchies, we allow finer-grained control over how the system should respond should a subset of the system fails

Imagine the application that has a process responsible for a remote connection, and a bunch of processes, all using this resource. When the connection process crashes, it’s, in any case, being restarted by its supervisor but its pid changes. Meaning all the process that relied on this pid should have been restarted as well. With :rest_for_one strategy it’s easy out of the box.

Another approach to this particular example would be to manage a connection in a process, supervised in another part of the tree, and upon connection issues manually crash the supervisor of pools using this connection to reinitialize all of them.

Even more, we might want to manually crash the process handling this connection to reinitialize it, instead of writing defensive code like if no_conn, do: reload_config_and_restart_connection we just let it crash and get reinitialized by the supervision tree with new proper config.

Last but not least, if the supervisor does not trap exits, it would crash as well, propagating it up. That way we might reinitialize the whole branch of supervision tree without writing a line of code.

Why have a supervision tree instead just one centralised supervisor?

1 Answers