spark + hadoop data locality

Question

I got an RDD of filenames, so an RDD[String]. I get that by parallelizing a list of filenames (of files inside hdfs).

Now I map this rdd and my code opens a hadoop stream using FileSystem.open(path). Then I process it.

When I run my task, I use spark UI/Stages and I see the "Locality Level" = "PROCESS_LOCAL" for all the tasks. I don't think spark could possibly achieve data locality the way I run the task (on a cluster of 4 data nodes), how is that possible?

It's the data locality of the original RDD[String]. By calling ` FileSystem.open(path)` you are not creating a new RDD. Why don't you get the Spark to load all files as an RDD instead of manual file opening? — vanekjar

mrsrinivas mrsrinivas · Accepted Answer · 2016-11-03T18:09:01

When FileSystem.open(path) gets executed in Spark tasks, File content will be loaded to local variable in same JVM process and prepares the RDD ( partition(s) ). so the data locality for that RDD is always PROCESS_LOCAL

-- vanekjar has already commented the on question

Additional information about data locality in Spark:

There are several levels of locality based on the data’s current location. In order from closest to farthest:

PROCESS_LOCAL data is in the same JVM as the running code. This is the best locality possible
NODE_LOCAL data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes
NO_PREF data is accessed equally quickly from anywhere and has no locality preference
RACK_LOCAL data is on the same rack of servers. Data is on a different server on the same rack so needs to be sent over the network, typically through a single switch
ANY data is elsewhere on the network and not in the same rack

Spark prefers to schedule all tasks at the best locality level, but this is not always possible. In situations where there is no unprocessed data on any idle executor, Spark switches to lower locality levels.

spark + hadoop data locality

2 Answers