Yarn Cluster optimization for Spark

Question

I trying to configure Yarn and Spark for my 4 node cluster.
Every node has the following specs:

24 cores
23.5 GB RAM
swap off

I configured Yarn and Spark so far that Spark can execute the SparkPi example calculation, but this works only under the following configuration of the yarn-site.xml:

<configuration>
<property>
        <name>yarn.acl.enable</name>
        <value>0</value>
</property>

<property>
        <name>yarn.resourcemanager.hostname</name>
        <value>ds11</value>
</property>

<property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>20480</value>
</property>

<property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>20480</value>
</property>

<property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>1536</value>
</property>

<property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
</property>

<property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
</property>

<property>
        <name>yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds</name>
        <value>3600</value>
</property>

And under the following spark-defaults.conf:

spark.master                     yarn
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://ds11:9000/spark-logs
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              2048m
spark.executor.memory            1024m
spark.yarn.am.memory             1024m
spark.executor.instances         16
spark.executor.cores             4

spark.history.provider            org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory     hdfs://ds11:9000/spark-logs
spark.history.fs.update.interval  10s
spark.history.ui.port             18080

The critical points are:

yarn.scheduler.minimum-allocation-mb

and

spark.executor.memory

If I set the yarn.scheduler.minimum-allocation-mb to just 1537mb or higher, then Spark can't allocate containers for the Spark Jobs.
So, when I start Spark I get the following Diagnostics:

2018-03-01 13:12:25,295 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
2018-03-01 13:12:25,296 INFO yarn.Client: Setting up container launch context for our AM
2018-03-01 13:12:25,299 INFO yarn.Client: Setting up the launch environment for our AM container
2018-03-01 13:12:25,306 INFO yarn.Client: Preparing resources for our AM container
2018-03-01 13:12:26,722 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
2018-03-01 13:12:29,899 INFO yarn.Client: Uploading resource file:/tmp/spark-19cf3747-6949-4117-ba92-ccde71d8b473/__spark_libs__7526053733120768643.zip -> hdfs://ds11:9000/user/nw/.sparkStaging/application_1519906323717_0001/__spark_libs__7526053733120768643.zip
2018-03-01 13:12:32,082 INFO yarn.Client: Uploading resource file:/tmp/spark-19cf3747-6949-4117-ba92-ccde71d8b473/__spark_conf__171844339516087904.zip -> hdfs://ds11:9000/user/nw/.sparkStaging/application_1519906323717_0001/__spark_conf__.zip
2018-03-01 13:12:32,167 INFO spark.SecurityManager: Changing view acls to: nw
2018-03-01 13:12:32,167 INFO spark.SecurityManager: Changing modify acls to: nw
2018-03-01 13:12:32,167 INFO spark.SecurityManager: Changing view acls groups to: 
2018-03-01 13:12:32,167 INFO spark.SecurityManager: Changing modify acls groups to: 
2018-03-01 13:12:32,167 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(nw); groups with view permissions: Set(); users  with modify permissions: Set(nw); groups with modify permissions: Set()
2018-03-01 13:12:32,175 INFO yarn.Client: Submitting application application_1519906323717_0001 to ResourceManager
2018-03-01 13:12:32,761 INFO impl.YarnClientImpl: Submitted application application_1519906323717_0001
2018-03-01 13:12:32,766 INFO cluster.SchedulerExtensionServices: Starting Yarn extension services with app application_1519906323717_0001 and attemptId None
2018-03-01 13:12:33,779 INFO yarn.Client: Application report for application_1519906323717_0001 (state: ACCEPTED)
2018-03-01 13:12:33,785 INFO yarn.Client: 
 client token: N/A
 diagnostics: [Thu Mar 01 13:12:32 +0100 2018] Application is added to the scheduler and is not yet activated. Skipping AM assignment as cluster resource is empty.  Details : AM Partition = <DEFAULT_PARTITION>; AM Resource Request = <memory:1537, vCores:1>; Queue Resource Limit for AM = <memory:0, vCores:0>; User AM Resource Limit of the queue = <memory:0, vCores:0>; Queue AM Resource Usage = <memory:0, vCores:0>; 
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 queue: default
 start time: 1519906352464
 final status: UNDEFINED
 tracking URL: http://ds11:8088/proxy/application_1519906323717_0001/
 user: nw
2018-03-01 13:12:34,789 INFO yarn.Client: Application report for application_1519906323717_0001 (state: ACCEPTED)
2018-03-01 13:12:35,794 INFO yarn.Client: Application report for application_1519906323717_0001 (state: ACCEPTED)

When I have the yarn.scheduler.minimum-allocation-mb on 1536mb and increase the spark.executor.memory to e.g 2048mb, I get the following Error:

2018-03-01 15:15:47,578 ERROR spark.SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Required executor memory (2048+384 MB) is above the max threshold (1536 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:319)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:167)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:509)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2516)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:918)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:910)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:910)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

When I increase both parameter I get still the first errortype, that Spark can't allocate container.

Maybe someone has an idea for this problem?

Did you restart the node manager and resource manager after changing the xml? — OneCricketeer
And have you copied the same xml file to all the YARN nodes? Editing the xml on the Spark client doesn't do anything — OneCricketeer
You'll also want to run jps on all 4 machines to verify a NodeManager process is running — OneCricketeer

OneCricketeer OneCricketeer · Accepted Answer · 2018-03-01T17:49:47

It sounds like you are only editing the yarn-site on the Spark client only.

If you want to change the actual YARN ResourceManager and NodeManager memory sizes, then you'll need to rsync that file across the whole cluster, then reboot the YARN services.

P.S. Setup HA ResourceManager if you don't have it already

Yarn Cluster optimization for Spark

1 Answers