Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS?

Question

Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS?
If I use Spark standalone cluster manager and have my data distributed in HDFS cluster, how will Spark know that data is located locally on the nodes?

OneCricketeer OneCricketeer · Accepted Answer · 2016-10-18T07:06:34

YARN is a resource manager. It deals with memory and processes, and not with the workings of HDFS or data-locality.

Since Spark can read from HDFS sources, and the namenodes & datanodes take care of all that HDFS block data management outside of YARN, then I believe the answer is no, you don't need YARN. But you already have HDFS, which means you have Hadoop, so why not take advantage of integrating Spark into YARN?

Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS?

1 Answers