How to prevent Spark Executors from getting Lost when using YARN client mode?

Question

I have one Spark job which runs fine locally with less data but when I schedule it on YARN to execute I keep on getting the following error and slowly all executors get removed from UI and my job fails

15/07/30 10:18:13 ERROR cluster.YarnScheduler: Lost executor 8 on myhost1.com: remote Rpc client disassociated
15/07/30 10:18:13 ERROR cluster.YarnScheduler: Lost executor 6 on myhost2.com: remote Rpc client disassociated

I use the following command to schedule Spark job in yarn-client mode

 ./spark-submit --class com.xyz.MySpark --conf "spark.executor.extraJavaOptions=-XX:MaxPermSize=512M" --driver-java-options -XX:MaxPermSize=512m --driver-memory 3g --master yarn-client --executor-memory 2G --executor-cores 8 --num-executors 12  /home/myuser/myspark-1.0.jar

What is the problem here? I am new to Spark.

try increasing executor memory. one of the common reason of executor failures is insufficient memory. when executor consumes more memory then assigned yarn kills it. logs provided by you gives no clue about reason of failure. use"yarn logs -applicationId <yarn application Id>" to check executor logs. — banjara
I am seeing this only when we run long running spark jobs. If it was a memory issue it should have failed initially. — Bonnie Varghese
Have you figured out how to solve this problem? I observe the same one with no logs confirming that executor went out of memory. I only see that driver killed executor, and that executor got SIGTERM signal, after this my application goes through infinite number of stage retries that always fail because single task fails with FetchFailedException: Executor is not registered. For some reason this type of task failure isn't even retried on different host, the whole stage is retried. — Dmitriy Sukharev
Use divide conquer make your spark job do less things in my case I divided my one Spark job into five different jobs. Make sure you shuffle less data like group by join etc. Make sure you dont cache much data use filter then do cache if needed use MEM_DISK_SER. If you dont cache much try to reduce spar.storage.fraction from 0.6 to less. Use Kryo try to use tungsten Spark 1.5.1 enabled it by default — unk1102
@shekhar yarn nodemanager logs doesn't alway reveal reason for KILLING. — nir

whaleberg whaleberg · Accepted Answer · 2015-11-11T03:16:36

I had a very similar problem. I had many executors being lost no matter how much memory we allocated to them.

The solution if you're using yarn was to set --conf spark.yarn.executor.memoryOverhead=600, alternatively if your cluster uses mesos you can try --conf spark.mesos.executor.memoryOverhead=600 instead.

In spark 2.3.1+ the configuration option is now --conf spark.yarn.executor.memoryOverhead=600

It seems like we were not leaving sufficient memory for YARN itself and containers were being killed because of it. After setting that we've had different out of memory errors, but not the same lost executor problem.

How to prevent Spark Executors from getting Lost when using YARN client mode?

3 Answers