0
votes

I am learning about Apache Spark and HDFS. I understand both of them for the most part although I am confused about one thing. My question is: Are the data nodes in the HDFS the same as the executor nodes in a spark cluster? In other words, are the nodes in the HDFS operating on the data that they contain or is the data from the datanodes in the HDFS sent to executors nodes in a spark cluster where the data is operated on? Please let me know if you would like me to clarify anything! Any help would be much appreciated!

Thank you,

Taylor

3

3 Answers

2
votes

I always think those concepts from a standalone perspective firstly, then to a cluster perspective.

Considering a single machine (and you will also run Spark in local mode), DataNode and NameNode are just pieces of software to support HDFS abstract design (that is NameNode stores file trees, file metadata etc, while DataNode stores actual data chunks.). driver and executors are concepts in Spark, in local mode, a Spark application consist of a driver process and a set of executor process , which run as threads on your individual computer.

2
votes

Only if the DataNode is also running a NodeManager. HDFS only handles data. YARN handles compute. YARN's ResourceManager assigns compute resources to NodeManagers which for obvious reasons are co-located with DataNodes.

YARN and Spark attempt to move executors to the DataNodes/NodeManagers which have the data Spark is processing (data-locality) but this is more an optimization and not a hard requirement. Especially since most modern data centers have 10GB Ethernet backplanes so the cost of moving the data to a spare node is less costly than before where moving data across the network was expensive.

0
votes

If your Spark cluster is running with a master of yarn, then yes, your Spark executors will be running on the same nodes in the Hadoop cluster that store data.

In fact, moving the computation to the data, instead of the data to the computation, is a key method of improving performance in a distributed computation, as moving a serialised task to a node is a lot cheaper than moving GBs of data to the task.