3
votes

If I have 3 spark applications all using the same yarn cluster, how should I set

yarn.nodemanager.resource.cpu-vcores

in each of the 3 yarn-site.xml?

(each spark application is required to have it's own yarn-site.xml on the classpath)

Does this value even matter in the client yarn-site.xml's ?

If it does:

Let's say the cluster has 16 cores.

Should the value in each yarn-site.xml be 5 (for a total of 15 to leave 1 core for system processes) ? Or should I set each one to 15 ?

(Note: Cloudera indicates one core should be left for system processes here: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ however, they do not go into details of using multiple clients against the same cluster)

Assume Spark is running with yarn as the master, and running in cluster mode.

1

1 Answers

1
votes

Are you talking about the server-side configuration for each YARN Node Manager? If so, it would typically be configured to be a little less than the number of CPU cores (or virtual cores if you have hyperthreading) on each node in the cluster. So if you have 4 nodes with 4 cores each, you could dedicate for example 3 per node to the YARN node manager and your cluster would have a total of 12 virtual CPUs.

Then you request the desired resources when submitting the Spark job (see http://spark.apache.org/docs/latest/submitting-applications.html for example) to the cluster and YARN will attempt to fulfill that request. If it can't be fulfilled, your Spark job (or application) will be queued up or there will eventually be a timeout.

You can configure different resource pools in YARN to guarantee a specific amount of memory/CPU resources to such a pool, but that's a little bit more advanced.

If you submit your Spark application in cluster mode, you have to consider that the Spark driver will run on a cluster node and not your local machine (that one that submitted it). Therefore it will require at least 1 virtual CPU more.

Hope that clarifies things a little for you.