I have been trying to understand the storm architecture, but I am not sure if I got this right. I'll try to explain as exactly as possible what I believe to be the case. Please explain what - if - I got wrong and what is right.
Preliminary thoughts: workers
http://storm.apache.org/releases/2.0.0-SNAPSHOT/Understanding-the-parallelism-of-a-Storm-topology.html suggests that a Worker is a process as does http://storm.apache.org/releases/2.0.0-SNAPSHOT/Concepts.html with "worker processes. Each worker process is a physical JVM", however http://storm.apache.org/releases/1.0.1/Setting-up-a-Storm-cluster.html states that a worker is a node with "Nimbus and worker machines". The website http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/ mentions "master node" and "worker nodes". So what: Is the worker a process or a physical node (or is a node a process)? Thus I think that there are two things: Worker Nodes and Worker Processes.
What I believe to be true
Entities in play
- Master Node = Management Server
- Worker Node = Slave Node
- Nimbus JVM process, running on Master Node
- ZooKeeper JVM processes, running on ZooKeeper Nodes
- Supervisor JVM process, running on Worker Nodes
- Worker Process (JVM), running on Worker Nodes
- Executor thread, run by Worker Process
- Task (instances of Bolts and Spouts), executed by Executor
How things work
The Nimbus is a JVM process, running on the physical Master Node, that receives my program (Storm topology) takes the Bolts and Spouts and generates tasks from them. If a Bolt is supposed to be parallelized three times, the Nimbus generates three tasks for it. The Nimbus asks the ZooKeeper JVM processes about configurations of the cluster, e.g. where to run those tasks and the ZooKeeper JVM processes tells the Nimbus. In order to do this the ZooKeeper communicates with the Supervisors (what they are comes later). The Nimbus distributes the tasks then to the Workers Nodes, which are physical nodes. Worker Nodes are managed by Supervisors, which are JVM processes - exactly one Supervisor for one Worker Node. Supervisors manage (start, stop, etc.) the Worker Processes, which are JVM processes that run on the Worker Nodes. Each Worker Node can have multiple Worker Processes running. Worker Processes are JVM processes. They run one or multiple threads called Executors. Each Executor Thread runs one or multiple Tasks, meaning one or multiple instances of a Bolt or Spout, but those have to be of the same Bolt.
If this is all true, it begs the questions:
- What is the point of having multiple Worker Processes run on one Worker Node - after all a process can use multiple processor cores, right?
- What is the point of having multiple Tasks run by one Executor thread if they have to be of the same Bolt/Spout? A thread is only run on one processor core, so the multiple Bolt/Spout have to run after each other and can not be parallelized. Running indefinitely in the Storm topology - what would be the point having two instances of the same Bolt/Spout in a Exectuor thread?
Edit: Good additional resource: http://www.tutorialspoint.com/apache_storm/apache_storm_cluster_architecture.htm