Are the data nodes in an HDFS the same as the executor nodes in a spark cluster?

Question

I am learning about Apache Spark and HDFS. I understand both of them for the most part although I am confused about one thing. My question is: Are the data nodes in the HDFS the same as the executor nodes in a spark cluster? In other words, are the nodes in the HDFS operating on the data that they contain or is the data from the datanodes in the HDFS sent to executors nodes in a spark cluster where the data is operated on? Please let me know if you would like me to clarify anything! Any help would be much appreciated!

Thank you,

Taylor

Jacqueline P. Jacqueline P. · Accepted Answer · 2019-05-21T03:27:19

I always think those concepts from a standalone perspective firstly, then to a cluster perspective.

Considering a single machine (and you will also run Spark in local mode), DataNode and NameNode are just pieces of software to support HDFS abstract design (that is NameNode stores file trees, file metadata etc, while DataNode stores actual data chunks.). driver and executors are concepts in Spark, in local mode, a Spark application consist of a driver process and a set of executor process , which run as threads on your individual computer.

Are the data nodes in an HDFS the same as the executor nodes in a spark cluster?

3 Answers