My understanding is that this application should occupy 5 cores in the cluster (4 executor cores and 1 driver core)
That's the perfect situation in YARN where it could give you 5 cores off the CPUs it manages.
but i dont observe this in the RM and Spark UIs.
Since the perfect situation does not occur often something it's nice to have as many cores as we could get from YARN so the Spark application could ever start.
Spark could just wait indefinitely for the requested cores, but that could not always be to your liking, could it?
That's why Spark on YARN has an extra check (aka minRegisteredRatio
) that's the minimum of 80% of cores requested before the application starts executing tasks. You can use spark.scheduler.minRegisteredResourcesRatio
Spark property to control the ratio. That would explain why you see less cores in use than requested.
Quoting the official Spark documentation (highlighting mine):
spark.scheduler.minRegisteredResourcesRatio
0.8 for YARN mode
The minimum ratio of registered resources (registered resources / total expected resources) (resources are executors in yarn mode, CPU cores in standalone mode and Mesos coarsed-grained mode ['spark.cores.max' value is total expected resources for Mesos coarse-grained mode] ) to wait for before scheduling begins. Specified as a double between 0.0 and 1.0. Regardless of whether the minimum ratio of resources has been reached, the maximum amount of time it will wait before scheduling begins is controlled by config spark.scheduler.maxRegisteredResourcesWaitingTime
.