0
votes

I am quite new to Spark world. In our application we have an in-built Spark standalone cluster(Version 2.4.3) which takes in submitted jobs by our main data engine loader application via spark submit master URL.

We have 3 worker slave nodes on different VMs. Interestingly because of some IOException which I am posting in a very limited and cryptic format to limit system internals. The Master assumes it needs to Re-Submit the same job/application to the same worker over and over again(10s of thousands of time)

Worker App/Job Logs which is the same for every Job RE-Submission

2020-04-28 11:31:15,466 INFO spark.SecurityManager: SecurityManager: authentication enabled; ui acls disabled; users with view permissions: Set(app_prod); groups with view permissions: Set(); users with modify permissions: Set(app_prod); groups with modify permissions: Set()
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$1(CoarseGrainedExecutorBackend.scala:201)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:64)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
... 4 more
Caused by: java.io.IOException: Failed to connect to load.box.ancestor.com/xx.xxx.xx.xxx:30xxx
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: load.box.ancestor.com/xx.xxx.xx.xxx:30xxx
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:779)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)

Below the Master Logs which is RE-SUBMITTING the same job over and over even though by the look of it the worker job/app is giving EXIT(1) signal

Spark Master job Logs:

2020-04-28 11:30:49,750 INFO master.Master: Launching executor app-27789323082123-23782/11635 on worker worker-29990224233349-yy.yy.yyy.yyy-7078
2020-04-28 11:30:52,195 INFO master.Master: Removing executor app-27789323082123-23782/11635 because it is EXITED
2020-04-28 11:30:52,195 INFO master.Master: Launching executor app-27789323082123-23782/11636 on worker worker-29990224233349-yy.yy.yyy.yyy-7078
2020-04-28 11:30:54,651 INFO master.Master: Removing executor app-27789323082123-23782/11636 because it is EXITED
2020-04-28 11:30:54,651 INFO master.Master: Launching executor app-27789323082123-23782/11637 on worker worker-29990224233349-yy.yy.yyy.yyy-7078
2020-04-28 11:30:57,201 INFO master.Master: Removing executor app-27789323082123-23782/11637 because it is EXITED
2020-04-28 11:30:57,201 INFO master.Master: Launching executor app-27789323082123-23782/11638 on worker worker-29990224233349-yy.yy.yyy.yyy-7078
2020-04-28 11:30:59,769 INFO master.Master: Removing executor app-27789323082123-23782/11638 because it is EXITED
2020-04-28 11:30:59,769 INFO master.Master: Launching executor app-27789323082123-23782/11639 on worker worker-29990224233349-yy.yy.yyy.yyy-7078

My Query is : We haven't modified spark.deploy.maxExecutorRetries so it should be default 10.

Does this error or repetitive submission impacted by this parameter or we need to check out another config for this issue in case Spark master is not able to recognize that the Worker job failed.

2

2 Answers

0
votes

Try setting below config

spark.task.maxFailures = 2
0
votes

We noticed that spark.port.maxRetries was set to x which along with spark.driver.port and spark.driver.blockManager.port per job was assigning only x/2 spark.driver.port for each job. For the next queued job this parameter was restricting any more jobs and also keeps on re-submitting the job with the same port. The only option we could come up with was to spark.port.maxRetries to a more sizeable number.