Spark with Hadoop Yarn : Use the entire cluster nodes

Question

I'm using Spark with HDFS Hadoop Storage and Yarn. My cluster contains 5 nodes (1 master and 4 slaves).

Master node : 48Gb RAM - 16 CPU Cores
Slave nodes : 12 Gb RAM - 16 CPU Cores

I'm executing two different processes : WordCount method and SparkSQL with two different files. All is working but I'm asking some question, maybe I don't understand very well Hadoop-Spark.

First example : WordCount

I executed WordCount function and I get result in two files (part-00000 and part-00001). The availability is slave4 and slave1 for part-00000 and slave3 and slave4 for part-00001.

Why not a part on slave2 ? Is it normal ?

When I look application_ID, I see the only 1 slave made the job :

Why my task is not well-distributed over my cluster ?

Second example : SparkSQL

In this case, I don't have saved file because I just want to return an SQL result, but only 1 slave node works too.

So why I have only 1 slave node which makes the task while I have a cluster which seems to work fine ?

The command line to execute this is :

time ./spark/bin/spark-submit --master yarn --deploy-mode cluster /home/valentin/SparkCount.py

Thank you !

You've not stated how large your dataset is... Why do you think all nodes need to be used? — OneCricketeer
Because I believed with cluster mode all nodes was used in order to make my process. I'm trying with a largest dataset to see this hypothesis. — Essex
Only as many nodes as necessary are used to process all InputSplits of the data. That doesn't necessarily mean all will be used... Also, you've not explicitly set the number of executors or the executor memory for spark-submit to use more than the default values — OneCricketeer

OneCricketeer OneCricketeer · Accepted Answer · 2018-04-10T12:38:30

spark.executor.instances defaults to 2

You need to increase this value to have more executors running at once

You can also tweak the cores and memory allocated to each executor. As far as I know, there is no magic formula.

If you want to not specify these values by hand. I might suggest reading the section on Speculative Execution in Spark documentation

Spark with Hadoop Yarn : Use the entire cluster nodes

1 Answers