I'm trying to setup a small Dataproc Spark cluster of 3 workers (2 regular and one preemptible) but I'm running into problems.
Specifically, I've been struggling to find a way to let the Spark application submitters to have freedom to specify the number of executors while being able to specify how many cores should be assigned to them
Dataproc image of Yarn and Spark has the following defaults:
- Spark dynamic allocation enabled
- Yarn Capacity Scheduler configured with
DefaultResourceCalculator
With these defaults the number of cores is not taken into account (the ratio container-vcores is always 1:1), as DefaultResourceCalculator
only cares about memory. In any case, when configured this way, the number of executors is honored (by means of setting spark.dynamicAllocation.enabled = false
and spark.executor.instances = <num>
as properties in gcloud submit)
So I changed it to DominantResourceCalculator
and now it takes care of the requested cores but I'm no longer able to specify the number of executors, regardless of disabling the Spark dynamic allocation or not.
It might also be of interest to know that the default YARN queue is limited to 70 % of capacity by configuration (in capacity-scheduler.xml) and that there is also another non-default queue configured (but not used yet). My understanding is that both Capacity and Fair schedulers do not limit the resource allocation in case of uncontended job submission as long as the max capacity is kept at 100. In any case, for the sake of clarity, these are the properties setup during the cluster creation:
capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
capacity-scheduler:yarn.scheduler.capacity.root.queues=default,online
capacity-scheduler:yarn.scheduler.capacity.root.default.capacity=30
capacity-scheduler:yarn.scheduler.capacity.root.online.capacity=70
capacity-scheduler:yarn.scheduler.capacity.root.online.user-limit-factor=1
capacity-scheduler:yarn.scheduler.capacity.root.online.maximum-capacity=100
capacity-scheduler:yarn.scheduler.capacity.root.online.state=RUNNING
capacity-scheduler:yarn.scheduler.capacity.root.online.acl_submit_applications=*
capacity-scheduler:yarn.scheduler.capacity.root.online.acl_administer_queue=*
The job submission is done by means of gcloud tool and the queue used is the default.
E.g, the following properties set when executing gcloud dataproc submit:
--properties spark.dynamicAllocation.enabled=false,spark.executor.memory=5g,spark.executor.instances=3
end up in the following assignment:
Is there a way to configure YARN so that it accepts both?
EDITED to specify queue setup
DominantResourceCalculator
at cluster-creation time incapacity-scheduler
? Did you change anything else in YARN or capacity-scheduler configs? You say it's a cluster of 2 workers, but the screenshot seems to show 3 "Active Nodes". What machine-type are you using? Are you specifying any non-default queues? It's a bit strange that your screenshot shows "120.4% of queue" but only "36.1% of cluster". – Dennis Huo