how does storm leverage zookeeper for resilience?

Question

from the description of Storm, it is based on Zookeeper, and whenever a worker node dies, it can be recovered and get its state from zookeeper.

Does any one know how that is done? specifically

how does the failed worker node get recovered?
how does zookeeper keep its state. AFAIK, each zone can only store a small amount to data.

Vishal Vishal · Accepted Answer · 2014-02-02T08:36:30

Are you talking about workers or supervisors? Each storm worker node runs a storm "supervisor" daemon which manages worker processes.

You need to setup supervision (something like daemontools or supervisord, which is unrelated to storm supervisors) to monitor and restart nimbus and supervisor daemons in case they take an exception. Both nimbus and supervisors are fail fast and stateless. Zookepeer is used for coordination between nimbus and supervisors along with holding state information, which is in zookeeper or on disk so as to not lose state information.
State data isn't large and Zookeeper should be run supervised too.

Check this for more fault tolerance details.