Spark: hdfs cluster mode

Question

I'm just getting started using Apache Spark. I'm using cluster mode (master, slave1, slave2) and I want to process a big file which is kept in Hadoop (hdfs). I am using the textFile method from SparkContext; while the file is being processing I monitorize the nodes and I can see that just the slave2 is working. After processing, slave2 has tasks but slave1 has no task. If instead of using a hdfs I use a local file then both slaves work simultaneously. I don't get why this behaviour. Please, can anybody give me a clue?

mgaido mgaido · Accepted Answer · 2016-06-02T12:52:49

The main reason of that behavior is the concept of data locality. When Spark's Application Master asks for the creation of new executors, they are tried to be allocated in the same node where data resides.

I.e. in your case, HDFS is likely to have written all the blocks of the file on the same node. Thus Spark will instantiate the executors on that node. Instead, if you use a local file, it will be present in all nodes, so data locality won't be an issue anymore.

Spark: hdfs cluster mode

1 Answers