I didn't try it but I suggest that the supervisor launches another supervisor (one per process) with the restart strategy simple_one_for_one
, and the restart child spec transient
.
Then this supervisor launch the process itself with the restart strategy one_for_one
and the restart child spec permanent
, and the maxrestarts and the maxtime fitting your need.
There is something strange in your question, you say that the supervisor kills all the children that were started when it reach the maxrestart for one faulty child, I thought that the simple_one_for_one strategy left the workers die by themselves.
[edit]
As I was curious to test this idea, I wrote a small set of module to test it.
her is the code of the top supervisor:
-module (factory).
-behaviour(supervisor).
-export([start_link/0]).
-export([init/1, start_process/1]).
-define(CHILD(I, Arglist), {I, {I, start_link, [Arglist]}, temporary, 5000, supervisor, [I]}).
start_link() ->
supervisor:start_link({local, ?MODULE}, ?MODULE, []).
init([]) ->
{ok, { {simple_one_for_one, 0, 10}, [?CHILD(proc_sup, [])]} }.
start_process(Arglist)->
supervisor:start_child(?MODULE, [Arglist]).
Then the code of the intermediate one, in charge to restart a few time a process in case of problem:
-module (proc_sup).
-behaviour(supervisor).
-export([start_link/2]).
-export([init/1]).
-define(CHILD(Mod, Start, Arglist), {Mod, {Mod, Start, Arglist}, permanent, 5000, worker, [Mod]}).
start_link(_,Arglist) ->
io:format("proc_sup arg = ~p~n",[Arglist]),
supervisor:start_link(?MODULE, [Arglist]).
init([[Mod,Start|[Arglist]]]) ->
{ok, { {one_for_one, 5, 10}, [?CHILD(Mod,Start,Arglist)]} }.
And then the code of a small modules that can be stopped, receive a message, be programmed to die after a certain time, in order to test the mechanism.
-module(dumb).
-export([start_link/1,loop/2]).
start_link(Arg) ->
io:format("dumb start param = ~p~n",[Arg]),
{ok,spawn_link(?MODULE,loop,[Arg,init])}.
loop({die,T},_) ->
receive
after T -> ok
end;
loop(Arg,init) ->
io:format("loop pid ~p with arg ~p~n",[self(),Arg]),
loop(Arg,0);
loop(Arg,N) ->
io:format("loop ~p (~p) cycle ~p~n",[Arg,self(),N]),
receive
stop -> 'restart_:o)';
_ -> loop(Arg,N+1)
end.
Finally a copy of the shell execution:
1> factory:start_link().
{ok,<0.37.0>}
2>
2> factory:start_process([dumb,start_link,[loop_1]]).
proc_sup arg = [dumb,start_link,[loop_1]]
dumb start param = loop_1
loop pid <0.40.0> with arg loop_1
loop loop_1 (<0.40.0>) cycle 0
{ok,<0.39.0>}
3>
3> factory:start_process([dumb,start_link,[loop_1]]).
proc_sup arg = [dumb,start_link,[loop_1]]
dumb start param = loop_1
loop pid <0.43.0> with arg loop_1
loop loop_1 (<0.43.0>) cycle 0
{ok,<0.42.0>}
4>
4> factory:start_process([dumb,start_link,[loop_2]]).
proc_sup arg = [dumb,start_link,[loop_2]]
dumb start param = loop_2
loop pid <0.46.0> with arg loop_2
loop loop_2 (<0.46.0>) cycle 0
{ok,<0.45.0>}
5>
5> pid(0, 2310, 0) ! hello.
hello
6>
6> pid(0, 40, 0) ! hello.
loop loop_1 (<0.40.0>) cycle 1
hello
7> pid(0, 40, 0) ! hello.
loop loop_1 (<0.40.0>) cycle 2
hello
8> pid(0, 40, 0) ! hello.
loop loop_1 (<0.40.0>) cycle 3
hello
9> pid(0, 43, 0) ! hello.
loop loop_1 (<0.43.0>) cycle 1
hello
10> pid(0, 43, 0) ! hello.
loop loop_1 (<0.43.0>) cycle 2
hello
11> pid(0, 40, 0) ! stop.
dumb start param = loop_1
stop
loop pid <0.54.0> with arg loop_1
loop loop_1 (<0.54.0>) cycle 0
12> pid(0, 40, 0) ! stop.
stop
13> pid(0, 54, 0) ! stop.
dumb start param = loop_1
stop
loop pid <0.57.0> with arg loop_1
loop loop_1 (<0.57.0>) cycle 0
14> pid(0, 57, 0) ! hello.
loop loop_1 (<0.57.0>) cycle 1
hello
15> factory:start_process([dumb,start_link,[{die,5}]]).
proc_sup arg = [dumb,start_link,[{die,5}]]
dumb start param = {die,5}
{ok,<0.60.0>}
16>
dumb start param = {die,5}
dumb start param = {die,5}
dumb start param = {die,5}
dumb start param = {die,5}
dumb start param = {die,5}
16> factory:start_process([dumb,start_link,[{die,50000}]]).
proc_sup arg = [dumb,start_link,[{die,50000}]]
dumb start param = {die,50000}
{ok,<0.68.0>}
17>
dumb start param = {die,50000}
17>
simple_one_for_one
supervisor all have the same specification. then I do not understand how "that particular" child that just died is different from the rest? How would your supervisor know that it needs not restart a child when it dies? – akonsu