Fault Tolerance in a Distributed Erlang

Question

How can I have fault-tolerance in a distributed application? As far as I understand, supervision tree just works for supervising local process (if I am right?). How can I supervise remote processes which are spawned on remote nodes. I need to supervise them and restart them in case of failure?

Hynek -Pichi- Vychodil Hynek -Pichi- Vychodil · Accepted Answer · 2014-05-17T10:20:43

Look at OTP Design Principles especially chapter 9 Distributed Applications and sub chapters 9.4 Failover and 9.5 Takeover.

If you are interested in topic generally you should look at famous thesis Making reliable distributed systems in the presence of software errors and also a ton of published books about topic. Some of materials are also on-line 3 Free E-Books and a Tutorial on Erlang. For example chaper about distribution Distribunomicon.

TL;TR? Long story short, just as you wrote, you have to monitor each other supervisor tree and restart in case of failure. You can even reinvent wheel because Erlang itself provides great tools for doing it or use existing solution form bare OTP to riak_core.

Fault Tolerance in a Distributed Erlang

1 Answers