0
votes

I'm using sparksql on top of hdfs.
Every hdfs node has a spark slave running.
When I run a large query, hdfs seems to be sending data between nodes to spark slaves.
Why is it that HDFS is not serving local spark with local data?
All tasks show locality level at ANY.
and I even set spark.locality.wait=10000.

Anything I'm missing or need to look at?

Thanks,

1

1 Answers

0
votes

Spark needs to ask YARN for executors before it runs in jobs. Because of this, yarn allocates containers for the executors without knowing where the data is. To fix this, you need to tell spark what files you're going to create when creating SparkContext like this (assuming you're using scala):

val locData = InputFormatInfo.computePreferredLocations(
    Seq(new InputFormatInfo(conf, classOf[TextInputFormat], new Path(“myfile.txt”)))

val sc = new SparkContext(conf, locData)