DynamicSupervisor
adheres to the same restart policy as the regular Supervisor
and it works the way it does for a good reason. Instead of trying to work around this behaviour we need to understand why it is the way it is.
Understanding supervisor’s purpose
A supervisor monitors its children and in case an unexpected failure brings any of them down, it will restart it with a known initial state. The key to understanding the rationale behind restart limits lies in the definition of unexpected failures.
Unexpected here does not mean something you hadn’t thought about before pushing untested code to production. It’s something that only happens in rare circumstances which are difficult to simulate during normal testing, something that’s difficult to reproduce and that does not happen very often.
Catching such failures is difficult even with the default limit of 3 restarts within 5 seconds. In fact, this limit is way too conservative for live systems. I think it’s mostly useful for catching bugs early in development. When a bug is causing a process to shut down immediately or soon after being started, it won’t take long before it reaches 3 restarts and causes its supervisor to die. At that point you should look for the bug and fix it.
A different way to fail
Assuming you do test your code and are still observing processes die regularly, you’re probably experiencing a different kind of failure – an expected one. I highly suggest reading Fred Hebert's article It's About the Guarantees which covers in great detail the way supervisors should be used and the guarantees they’re supposed to provide. A very brief and abridged version of it:
Supervised processes provide guarantees in their initialization phase, not a best effort. This means that when you're writing a client for a database or service, you shouldn't need a connection to be established as part of the initialization phase unless you're ready to say it will always be available no matter what happens.
If you do require a connection to the database to be established in a process’s init()
callback, failing to connect then really does mean the process cannot function and should die. When its restarted by the supervisor yet it keeps failing, that does indeed mean the whole supervision tree cannot function correctly and should die. This continues recursively until the root supervisor is reached and the whole system goes down.
Now, Elixir provides a lot of solutions to various problems like this out of the box. In a way this is really nice, but it also often makes those problems invisible, leaving newcomers unaware of their existence. For example, Ecto depends on db_connection under the hood to provide a default exponential backoff when a connection to the database cannot be established. This behaviour is described in db_connection’s docs.
So what should you do?
Going back to your problem, at this point it should be clear that another approach has to be employed for a process which can fail often and it’s not a bug that’s causing it. You need to acknowledge that its failure is expected and handle it explicitly in your code.
Perhaps, your process depends on an external service that may occasionally be unavailable. In that case, you need to use a circuit breaker. There’s one written in Erlang called fuse which is nicely described by its author in this comment on Hacker News.
Netflix has a blog post showcasing the use of circuit breakers in their API which receives a pounding of billions of requests on a daily basis. That’s a mind-boggling scale and it’s even bigger now since that post is from 2011!
If that’s still not the kind of failure you’re experiencing, then, perhaps, you run untrusted code that cannot be relied on? Wrap it in a try-rescue block and return errors as values instead of relying on the supervisor to magically handle them for you.
I hope this helps.
:temporary
. You can then trap abnormal exits, at the point of which you make it launch again in the supervisor. By maintaining a count in the genserver process state, you could then see if at abnormal exit trap, that count exceeds a threshold and ignore any further restarts accordingly. – Kevin Johnson