Spark data locality in a cloud environment

Question

When running my Spark jobs in an HDInsight cluster and reading data from the Azure Data Lake Store, I see that the locality level of my tasks always seems to be set to PROCESS_LOCAL. However, I don't quite understand how such data locality can be achieved in a cloud environment.

Is Azure actually moving my code close to the data as can be done with regular HDFS or is the locality level simply set to PROCESS_LOCAL while the data is in reality being loaded over network?

In other words, is Azure somehow provisioning the HDInsight worker nodes into proximity of the data I have in the ADLS or what is the explanation for the locality level I see in Spark UI?

abiratsis abiratsis · Accepted Answer · 2018-03-23T11:29:22

First of all the PROCESS_LOCAL is the optimal level of locality. In your case that means that shuffling is not required. Which in return means that your Spark application does not need to move any data between workers thus is able to execute you job very fast. Furthermore it means that the resources of your Azure cluster are sufficient enough to make possible loading all the partitions of your dataset at once and execute all of them within the same process.

Useful resources over locality:

http://www.waitingforcode.com/apache-spark/spark-data-locality/read

http://www.russellspitzer.com/2017/09/01/Spark-Locality/

Spark data locality in a cloud environment

1 Answers