Does enabling, CPU scheduling in YARN will really improve the parallel processing in spark?

Question

YARN with capacity scheduler will take only memory into account when it is allocating resources for user requests If I submit a spark job like this "--master yarn --deploy-mode client --driver-memory 4g --executor-memory 4g --num-executors 1 --executor-cores 3", yarn will allocate an executor with 4gb memory and 1 vcpu, but when it is executing tasks, it will execute 3 tasks parallelly.

Is it using that single core alone to execute all tasks as a set of 3 at a time?

So If I enable CPU scheduling and CGroups (in HDP cluster), will yarn assign 3 vcpu cores and will that set of 3 tasks will get executed in each cpu? Will it really improve the processing time?

As for now, I could not enable CPU scheduling in my cluster (HDP 2.6.5 centos 7.5) due to the below error in starting node manager "Not able to enforce cpu weights; cannot write to cgroup at: /sys/fs/cgroup/cpu,cpuacct"

tk421 tk421 · Accepted Answer · 2018-11-15T22:20:36

No, vcores and vcpus are really a logical construct that are not related to what is actually on the system but more closely related to how many processes are running. The OS (Linux in this case) will migrate work to all CPUs if the process is designed for this. Most long running Java applications will do this due to the multiple threads running.

YARN doesn't control CPU cores unless you enable CGroups. The only thing YARN controls is the memory usage. The reason this doesn't matter is because typical Hadoop workloads are I/O bound not CPU bound.

References

Using CGroups with YARN

Does enabling, CPU scheduling in YARN will really improve the parallel processing in spark?

1 Answers

References