0
votes

As in spark we can load data directly from HDFS and number of partitions of RDD will be equal to number of partitions of file. HDFS as known for keeping duplicate chunks of files, so question is how spark deal with this and how RDD partition being governed.

Correct me if I went wrong in asking question.

2

2 Answers

1
votes

You want to bring computation to data, so depending where the task will be performed (which physical node will keep the persistent data), you will use the closest available replica (same rack, etc) or perform the scheduling based on where the data is available. This part is handled by the YARN scheduler.

0
votes

As you can check from spark user guide there are some configuration regarding the data locality that you can set (extracted from spark 1.6 user guide http://spark.apache.org/docs/latest/configuration.html ) :

spark.locality.wait default : 3s
How long to wait to launch a data-local task before giving up and launching it on a less-local node. The same wait will be used to step through multiple locality levels (process-local, node-local, rack-local and then any). It is also possible to customize the waiting time for each level by setting spark.locality.wait.node, etc. You should increase this setting if your tasks are long and see poor locality, but the default usually works well.

spark.locality.wait.node default : spark.locality.wait
Customize the locality wait for node locality. For example, you can set this to 0 to skip node locality and search immediately for rack locality (if your cluster has rack information).

spark.locality.wait.process default:spark.locality.wait
Customize the locality wait for process locality. This affects tasks that attempt to access cached data in a particular executor process.

spark.locality.wait.rack default:spark.locality.wait
Customize the locality wait for rack locality