3
votes

I'm trying to understand what's happening here:

I have a supervisor that is cyclically restarting one client without triggering the MaxR, MaxT mechanism. The client just crashes slowly enough never to trigger the rate limitation.

There would have been another mechanism that uses supervisor:which_children/1 and delete_child/2, start_child/2 to adapt the set of children to reality (its scanning for USB devices trying to have one supervisor child per device found).

This would normally behave like a safety net to the rate limitation, but strangely it looks like the mechanism that deletes and starts children is not called at all.

To find out what's going on I called supervisor:which_children/1 from the shell and it looks like the call just blocks and never returns.

Can it be that calls to the supervisor are blocked while it is busy trying to restart a child?

Addendum:

it looks like the crash happens during child start:

=SUPERVISOR REPORT==== 29-Mar-2011::21:36:20 ===
     Supervisor: {local,gateway_sup}
     Context:    start_error
     Reason:     {'EXIT',{timeout,{gen_server,call,[<0.155.0>,late_init]}}}
     Offender:   [{pid,<0.76.0>},
              {name,gw_3_5},
              {mfa,{channel,start_link,
                            [[{gateways,[{left,108},{right,103}]}],
                             {3,5}]}},
              {restart_type,transient},
              {shutdown,10000},
              {child_type,worker}]
1
Are you making a gen_server:call in the start_link function of the child?Adam Lindberg
Yes I do. I need some late initialization that needs to be done after the gen_server is already running.Peer Stritzinger
Why don't you do this in the init function instead? Seems that there may be risk for dead lock here...Adam Lindberg
@Adam: the stuff in late_init needs the gen_server already running (needs the pid of the gen_server). I don't see any deadlock possibility here (and the reason for the timeout is known). You can see the code here ideone.com/KtM6NPeer Stritzinger
Your problem aside, you should be able to just run self() in the gen_server process to get it's own pid.Adam Lindberg

1 Answers

2
votes

The answer to the question besides the discussion is:

When restarting a child that fails during startup the supervisor loops inside its process (it is a gen_server internally) not handling any API calls to it.

So it is especially bad if the rate limitation of the supervisor is configured that it will not trigger on startup errors of the children. I have a slow startup (especially on error) in my example.

So if the supervisor loops forever trying to restart a child it is not reachable for any calls to it ... which is usually bad.