1
votes

I was running an application on AWS EMR-Spark. Here, is the spark-submit job;-

Arguments : spark-submit --deploy-mode cluster --class com.amazon.JavaSparkPi s3://spark-config-test/SWALiveOrderModelSpark-1.0.assembly.jar s3://spark-config-test/2017-08-08

AWS uses YARN for resource management. I was looking at the metrics (screenshot below), and have a doubt regarding the YARN 'container' metrics.

Screenshot

Here, the container allocated is shown as 2. However, I was using 4 nodes (3 slave + 1 master),all 8 cores CPU. So, how are only 2 container allocated?

1
How much memory of cores are you allocating for the job? Remember that one of your nodes runs the Application Manager which reserves 1 core on that node. Thus if you are attempting to assign 8 cores for each executor it will only launch 2 executors.Glennie Helles Sindholt
The Application Manager runs on the master node right? Also, I had not specified any memory of cores, so it must be picking up some default.Sanchay
No the AM runs on a slave - not on the driver. Have you remembered to adjust capacity-scheduler.xml with this setting: "yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"? And you need to specify the number of executors you want. YARN will not automatically utilize your entire cluster.Glennie Helles Sindholt

1 Answers

2
votes

A couple of thing you need to do. First of all, you need to set the following configuration in capacity-scheduler.xml

"yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"

otherwise YARN will not use all the cores you specify. Secondly, you need to actually specify the number of executors you need, and also the number of cores you need and the amount of memory you want allocated on executors (and possibly on the driver as well, if you either have many shuffle partitions or if you collect data to the driver).

YARN is designed to manage clusters running many different jobs at the time, so it will not per default assign all ressources to a single job, unless you force it to by setting the above mentioned setting. Furthermore, the default setting for Spark are also not sufficient for most jobs and you need to set them explicitly. Please have a read through this blog post to get a better understanding of how to tune spark settings for optimal performance.