2
votes

I have been trying to understand the storm architecture, but I am not sure if I got this right. I'll try to explain as exactly as possible what I believe to be the case. Please explain what - if - I got wrong and what is right.

Preliminary thoughts: workers

http://storm.apache.org/releases/2.0.0-SNAPSHOT/Understanding-the-parallelism-of-a-Storm-topology.html suggests that a Worker is a process as does http://storm.apache.org/releases/2.0.0-SNAPSHOT/Concepts.html with "worker processes. Each worker process is a physical JVM", however http://storm.apache.org/releases/1.0.1/Setting-up-a-Storm-cluster.html states that a worker is a node with "Nimbus and worker machines". The website http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/ mentions "master node" and "worker nodes". So what: Is the worker a process or a physical node (or is a node a process)? Thus I think that there are two things: Worker Nodes and Worker Processes.

What I believe to be true

Entities in play

  • Master Node = Management Server
  • Worker Node = Slave Node
  • Nimbus JVM process, running on Master Node
  • ZooKeeper JVM processes, running on ZooKeeper Nodes
  • Supervisor JVM process, running on Worker Nodes
  • Worker Process (JVM), running on Worker Nodes
  • Executor thread, run by Worker Process
  • Task (instances of Bolts and Spouts), executed by Executor

How things work

The Nimbus is a JVM process, running on the physical Master Node, that receives my program (Storm topology) takes the Bolts and Spouts and generates tasks from them. If a Bolt is supposed to be parallelized three times, the Nimbus generates three tasks for it. The Nimbus asks the ZooKeeper JVM processes about configurations of the cluster, e.g. where to run those tasks and the ZooKeeper JVM processes tells the Nimbus. In order to do this the ZooKeeper communicates with the Supervisors (what they are comes later). The Nimbus distributes the tasks then to the Workers Nodes, which are physical nodes. Worker Nodes are managed by Supervisors, which are JVM processes - exactly one Supervisor for one Worker Node. Supervisors manage (start, stop, etc.) the Worker Processes, which are JVM processes that run on the Worker Nodes. Each Worker Node can have multiple Worker Processes running. Worker Processes are JVM processes. They run one or multiple threads called Executors. Each Executor Thread runs one or multiple Tasks, meaning one or multiple instances of a Bolt or Spout, but those have to be of the same Bolt.

If this is all true, it begs the questions:

  • What is the point of having multiple Worker Processes run on one Worker Node - after all a process can use multiple processor cores, right?
  • What is the point of having multiple Tasks run by one Executor thread if they have to be of the same Bolt/Spout? A thread is only run on one processor core, so the multiple Bolt/Spout have to run after each other and can not be parallelized. Running indefinitely in the Storm topology - what would be the point having two instances of the same Bolt/Spout in a Exectuor thread?

Edit: Good additional resource: http://www.tutorialspoint.com/apache_storm/apache_storm_cluster_architecture.htm

1

1 Answers

1
votes

First a few clarification (basically you got it right).

  • The term "node" and "process" are not always used consistently (unfortunately). But you got the "Entities in Play" correct.
  • About parallelization: The number of tasks is determined by .setNumTasks() -- if not specified, it is the same as parallelism_hint (which sets the number of executors).
  • About deployment: Nimbus gets cluster configuration from ZK, but ZK does not decides where to run a task -- Nimbus has a scheduler component that makes this decision (based on given topology, topology configuration, and cluster configuration).

To answer your question:

  • A single worker process only executes task from a single topology. Thus, if you want to run multiple topologies you need multiple worker JVMs. The reason for this design is fault-tolerance, ie, isolation of topologies. If one topology fails (maybe due to bad user code), the crashing JVM does not affect other running topologies.
  • Tasks allow for dynamic re-balancing and changing the parallelism at runtime. Without tasks, you would need to stop and redeploy a topology if you want to change the parallelism for a spout/bolt. Thus, the number of tasks defines your maximum parallelism for a spout/bolt. See https://storm.apache.org/releases/1.0.1/Understanding-the-parallelism-of-a-Storm-topology.html