Spark job is working properly with yarn-client, but not working at all with yarn-cluster

Question

I am facing a issue when submitting a spark job jar in yarn. It is working well and giving me as expected result when I am submitting it with --master yarn-client

The command is as follow;

./spark-submit --class main.MainClass --master yarn-client --driver-memory 4g --executor-memory 4g --num-executors 4 --executor-cores 2 job.jar other-options

But the same is not working when submitting into cluster mode; command as follow;

./spark-submit --class main.MainClass --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 4g --num-executors 4 --executor-cores 2 job.jar other-options"

My yarn-site.xml is as follow;

 <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>128</value>
    <description>Minimum limit of memory to allocate to each container request at the Resource Manager.</description>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>20048</value>
    <description>Maximum limit of memory to allocate to each container request at the Resource Manager.</description>
</property>
<property>
    <name>yarn.scheduler.minimum-allocation-vcores</name>
    <value>1</value>
    <description>The minimum allocation for every container request at the RM, in terms of virtual CPU cores. Requests lower than this won't take effect, and the specified value will get allocated the minimum.</description>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-vcores</name>
    <value>2</value>
    <description>The maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value.</description>
</property>
<property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>24096</value>
    <description>Physical memory, in MB, to be made available to running containers</description>
</property>
<property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>4</value>
    <description>Number of CPU cores that can be allocated for containers.</description>
</property>
<property>
    <name>yarn.nodemanager.pmem-check-enabled</name>
    <value>false</value>
</property>
<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>

My yarn stderr log is

        17/03/23 03:30:44 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3315fed4{/static,null,AVAILABLE}
    17/03/23 03:30:44 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3e430b9a{/,null,AVAILABLE}
    17/03/23 03:30:44 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@77184f65{/api,null,AVAILABLE}
    17/03/23 03:30:44 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@643f7b84{/stages/stage/kill,null,AVAILABLE}
    17/03/23 03:30:44 INFO server.ServerConnector: Started ServerConnector@27614db2{HTTP/1.1}{0.0.0.0:37212}
    17/03/23 03:30:44 INFO server.Server: Started @7799ms
    17/03/23 03:30:44 INFO util.Utils: Successfully started service 'SparkUI' on port 37212.
    17/03/23 03:30:44 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://50.31.66.56:37212
    17/03/23 03:30:44 INFO cluster.YarnClusterScheduler: Created YarnClusterScheduler
    17/03/23 03:30:44 INFO cluster.SchedulerExtensionServices: Starting Yarn extension services with app application_1490254182417_0001 and attemptId Some(appattempt_1490254182417_0001_000001)
    17/03/23 03:30:44 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 45469.
    17/03/23 03:30:44 INFO netty.NettyBlockTransferService: Server created on 50.31.66.56:45469
    17/03/23 03:30:44 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 50.31.66.56, 45469)
    17/03/23 03:30:44 INFO storage.BlockManagerMasterEndpoint: Registering block manager 50.31.66.56:45469 with 2004.6 MB RAM, BlockManagerId(driver, 50.31.66.56, 45469)
    17/03/23 03:30:44 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 50.31.66.56, 45469)
    17/03/23 03:30:44 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@60245f4e{/metrics/json,null,AVAILABLE}
    17/03/23 03:30:49 INFO scheduler.EventLoggingListener: Logging events to hdfs://mecku-1:54310/spark/application_1490254182417_0001_1
    17/03/23 03:30:49 INFO cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark://[email protected]:50465)
    17/03/23 03:30:49 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8030
    17/03/23 03:30:49 INFO yarn.YarnRMClient: Registering the ApplicationMaster
    17/03/23 03:30:49 INFO yarn.YarnAllocator: Will request 4 executor containers, each with 2 cores and 4505 MB memory including 409 MB overhead
    17/03/23 03:30:49 INFO yarn.YarnAllocator: Canceled 0 container requests (locality no longer needed)
    17/03/23 03:30:49 INFO yarn.YarnAllocator: Submitted container request (host: Any, capability: <memory:4505, vCores:2>)
    17/03/23 03:30:49 INFO yarn.YarnAllocator: Submitted container request (host: Any, capability: <memory:4505, vCores:2>)
    17/03/23 03:30:49 INFO yarn.YarnAllocator: Submitted container request (host: Any, capability: <memory:4505, vCores:2>)
    17/03/23 03:30:49 INFO yarn.YarnAllocator: Submitted container request (host: Any, capability: <memory:4505, vCores:2>)
    17/03/23 03:30:49 INFO yarn.ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
    17/03/23 03:30:49 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
    17/03/23 03:30:49 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
    17/03/23 03:30:49 INFO yarn.ApplicationMaster: Deleting staging directory hdfs://localhost:54310/user/root/.sparkStaging/application_1490254182417_0001
    17/03/23 03:30:49 INFO storage.DiskBlockManager: Shutdown hook called
    17/03/23 03:30:49 INFO util.ShutdownHookManager: Shutdown hook called
    17/03/23 03:30:49 INFO util.ShutdownHookManager: Deleting directory /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1490254182417_0001/spark-d77de654-4040-4b43-8155-efb155008b4b
    17/03/23 03:30:49 INFO util.ShutdownHookManager: Deleting directory /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1490254182417_0001/spark-d77de654-4040-4b43-8155-efb155008b4b/userFiles-d71596df-df26-4b88-b51e-f0b962daf84a
    17/03/23 03:30:40 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1490254182417_0001_000001

17/03/23 03:30:40 INFO spark.SecurityManager: Changing view acls to: root 17/03/23 03:30:40 INFO spark.SecurityManager: Changing modify acls to: ro

But after everything my spark job is not running, as you can see no error is shown here. Any idea behind the issue?

Can you try --master yarn-cluster instead of --master yarn and also remove --deploy-mode cluster and check — mbaxi
Also can you specify version of spark and hadoop distribution you are using? — mbaxi
@FaigB in std out I have only the output which I write using System.out.println in java code — KOUSIK MANDAL

Navieclipse Navieclipse · Accepted Answer · 2017-04-17T06:27:06

Maybe, your slave-nodes don't work. You should check your nodes below command,

sudo -u yarn yarn node -list

If you can't find all nodes, you should fix settings of nodes. For example, selinux off (checking getenforce), and each nodes' yarn-site.xml and core-site.xml.

Spark job is working properly with yarn-client, but not working at all with yarn-cluster

1 Answers