1
votes

I want to understand the following terms:

hadoop (single-node and multi-node) spark master spark worker namenode datanode

What I understood so far is spark master is the job executor and handles all the spark workers. Whereas hadoop is the hdfs (where our data resides) and from where spark workers reads data according to the job given to them. Please correct me if I wrong.

I also want to understand the role of namenode and datanode. Though I know the role of namenode (having the metadata info of all datanodes and it should be only one preferably, but could be two) and datanodes could be multiple and having the data.

Are datanodes the same hadoop nodes?

2

2 Answers

5
votes

SPARK Architecture:

Spark uses a master/worker architecture. There is a driver that talks to a single coordinator called master that manages workers in which executors run. enter image description here

The driver and the executors run in their own Java processes. You can run them all on the same (horizontal cluster) or separate machines (vertical cluster) or in a mixed machine configuration.

Node are nothing but the physical machines.

Hadoop NameNode and DataNode:

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

enter image description here

Yeah, DataNodes are the slave node in Hadoop cluster.

Please refer the documentation for more details.

0
votes

Hadoop single-node Hadoop cluster with 1 Namenode(master) and 1 Datanode(slave). Namenode have all the metadata and assigns for to slaves datanodes where data is stored and processing is done.

Hadoop multi-node Hadoop cluster with 1 Namenode(master) and n Datanode(slave)

spark master Same as Namenode in HDFS

spark worker Same as datanode but spark worker is only meant for processing not storing data.

To put thing in context(simple) - If there is 1 Namenode and 2 datanode(1GB memory) cluster. A 2 GB file will be split and stored on datanodes. Similarly to spark job will be split to process this data on individual datanodes(workers) in parallel.