I have a HDFS cluster (say it has 5 data nodes), if I want to setup a Spark cluster (say it has 3 worker nodes) read/write data to the HDFS cluster, do I need to make sure the Spark worker nodes are in the same machines of the HDFS data nodes? IMO they can be different machines. But if Spark worker nodes and HDFS data nodes are different machines, when read data from HDFS, Spark worker nodes need to download data from different machines which can lead to higher latency. While if they are on same machines latency can be reduced. Is my understanding correct?
1 Answers
In a bare metal set up and as originally postulated by MR, the Data Locality principle applies as you state, and Spark would be installed on all the Data Nodes, implying they would be also a Worker Node. So, Spark Worker resides on Data Node for rack-awareness and Data Locality for HDFS. That said, there are other storage managers such as KUDU now and other NOSQL variants that do not use HDFS.
With Cloud approaches for Hadoop you see Storage and compute divorced necessarily, e.g. AWS EMR and EC2, et al. That cannot be otherwise in terms of elasticity in compute. Not that bad as Spark shuffles to same Workers once data gotten for related keys where possible.
So, for Cloud the question is not actually relevant anymore. For bare metal Spark can be installed on different machines but would not make sense. I would install on all HDFS nodes, 5 not 3 as I understand in such a case.