How can a supervisor that reached_max_restart_intensity only delete the offending child?

Question

I have a one_for_one supervisor that handles similar and totally independent children.

When there is a problem with one child, repeatedly crashing and triggering:

=SUPERVISOR REPORT==== 30-Mar-2011::13:10:42 ===
     Supervisor: {local,gateway_sup}
     Context:    shutdown
     Reason:     reached_max_restart_intensity
     Offender:   [{pid,<0.76.0>}, ...

shutting itself down and also terminating all the innocent children that would just continue to run fine otherwise.

How can I build a supervision tree out of standard Erlang supervisors that only stops to restart the one offending child and leaves the others alone?

I was thinking about having a extra supervisor with just one single child but this seems to heavyweight to me.

Any other ways to handle this?

The extra supervisor would just push the problem further down. It'll still crash and then crash the top level supervisor. In that case, just increase your maximum restarts and maximum time values... — Adam Lindberg
@Adam: I need to stop restarting the child because it seems to prevent me talking to the supervisor. So I really want it to stop restarting the offending child but without terminating the rest of them. I hoped this can be achieved without writing my own supervisor. — Peer Stritzinger
Correct me if I'm wrong, but I summarize your scenario as: you have a supervisor with totally independent children where the children can either timeout when starting or crash during runtime and you want to restart each child on it's own and stop restarting a misbehaving child without affecting the other children? — Adam Lindberg
Feels like you're crashing wrong. Adam's solution solves the problem neatly but it feels like you need to take a look at why you are crashing rather than crashing and waiting for manual intervention. If your supervisor is locked because it is busy trying to restart something which is instantly crashing you're crashing wrong. (Yes, there is such a thing as crashing wrong ;) I would suggest reviewing your recovery strategy I.e. how you restart from a crash and what faulty data that might lock it into a tight loop. — Mazen Harake
@Mazen: your feeling is right: I'm crashing wrong and am about to change it. But I used the opportunity to make the whole supervision stuff more robust against "wrong crashing". Trouble is that while I can improve the common case the code I'm startig with this has quite complicated startup and it is very important that the other children keep running even when one will unexpectedly behave very nasty (terminating children without reason will produce significant monetary loss for the customer here, there might even be hardware destroyed when it happens) — Peer Stritzinger

Adam Lindberg Adam Lindberg · Accepted Answer · 2011-03-30T12:36:25

I think the best solution would be to have two layers of supervision.

One supervisor which starts a supervisor + process pair for each gen_server you want running. This supervisor is configured with one_for_one strategy and temporary children.

Each supervisor running under this supervisor would have correctly configured MaxR and MaxT values, that will trigger a crash of that supervisor once the child misbehaves.

When the lower level supervisor crashes, the top level supervisor "just doesn't care".

A supervisor consumes 233 bytes when started with one child (total heap size) so memory consumption should not be an issue.

The supervision tree should look like:

supervisor_top
    |
    |
    +------------------------+-----    ...
    |                        |
 supervisor_1               supervisor_2
 restart temporary          restart temporary
    |                         |
  gen_server_1              gen_server_2
  restart transient         restart transient

How can a supervisor that reached_max_restart_intensity only delete the offending child?

1 Answers