Is there a way to specify all three resource properties (executor instances, cores and memory) in Spark on YARN (Dataproc)

Question

I'm trying to setup a small Dataproc Spark cluster of 3 workers (2 regular and one preemptible) but I'm running into problems.

Specifically, I've been struggling to find a way to let the Spark application submitters to have freedom to specify the number of executors while being able to specify how many cores should be assigned to them

Dataproc image of Yarn and Spark has the following defaults:

Spark dynamic allocation enabled
Yarn Capacity Scheduler configured with DefaultResourceCalculator

With these defaults the number of cores is not taken into account (the ratio container-vcores is always 1:1), as DefaultResourceCalculator only cares about memory. In any case, when configured this way, the number of executors is honored (by means of setting spark.dynamicAllocation.enabled = false and spark.executor.instances = <num> as properties in gcloud submit)

So I changed it to DominantResourceCalculator and now it takes care of the requested cores but I'm no longer able to specify the number of executors, regardless of disabling the Spark dynamic allocation or not.

It might also be of interest to know that the default YARN queue is limited to 70 % of capacity by configuration (in capacity-scheduler.xml) and that there is also another non-default queue configured (but not used yet). My understanding is that both Capacity and Fair schedulers do not limit the resource allocation in case of uncontended job submission as long as the max capacity is kept at 100. In any case, for the sake of clarity, these are the properties setup during the cluster creation:

capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
capacity-scheduler:yarn.scheduler.capacity.root.queues=default,online
capacity-scheduler:yarn.scheduler.capacity.root.default.capacity=30
capacity-scheduler:yarn.scheduler.capacity.root.online.capacity=70
capacity-scheduler:yarn.scheduler.capacity.root.online.user-limit-factor=1
capacity-scheduler:yarn.scheduler.capacity.root.online.maximum-capacity=100
capacity-scheduler:yarn.scheduler.capacity.root.online.state=RUNNING
capacity-scheduler:yarn.scheduler.capacity.root.online.acl_submit_applications=*
capacity-scheduler:yarn.scheduler.capacity.root.online.acl_administer_queue=*

The job submission is done by means of gcloud tool and the queue used is the default.

E.g, the following properties set when executing gcloud dataproc submit:

--properties spark.dynamicAllocation.enabled=false,spark.executor.memory=5g,spark.executor.instances=3

end up in the following assignment:

Is there a way to configure YARN so that it accepts both?

EDITED to specify queue setup

Did you set DominantResourceCalculator at cluster-creation time in capacity-scheduler? Did you change anything else in YARN or capacity-scheduler configs? You say it's a cluster of 2 workers, but the screenshot seems to show 3 "Active Nodes". What machine-type are you using? Are you specifying any non-default queues? It's a bit strange that your screenshot shows "120.4% of queue" but only "36.1% of cluster". — Dennis Huo
Fair enough, I forgot to add when writing the issue that the cluster comprises 2 workers and one preemptible worker, hence it is a cluster of three. I'll just edit it to fix it. It is also true that I created a new queue in YARN, but I did not mention it cause I thought it had nothing to do with the issue I was seeing (my understanding is that capacity scheduler allocates the whole resources in the cluster when no other queue request any allocation). I'll add the details of the queue setup, though. All of the changes to capacity-scheduler were made during cluster creation (using --properties) — Daniel Martínez

George George · Accepted Answer · 2017-09-20T20:13:42

You may try setting a higher value, such as 2, for yarn.scheduler.capacity.root.online.user-limit-factor in place of the present value of 1, the value you have set. This setting enables the user to leverage twice the chosen capacity. Your setting of 100% as the maximum capacity allows for this doubling of the chosen capacity.

Is there a way to specify all three resource properties (executor instances, cores and memory) in Spark on YARN (Dataproc)

1 Answers