Spark standalone cluster tuning

Question

We have spark 2.1.0 standalone cluster running on a single node with 8 cores and 50GB memory(single worker).

We run spark applications in cluster mode with the following memory settings -

--driver-memory = 7GB (default - 1core is used)
--worker-memory = 43GB (all remaining cores - 7 cores)

Recently, we observed executor getting killed and restarted by driver/master frequently. I found below logs on driver -

17/12/14 03:29:39 WARN HeartbeatReceiver: Removing executor 2 with no recent heartbeats: 3658237 ms exceeds timeout 3600000 ms  
17/12/14 03:29:39 ERROR TaskSchedulerImpl: Lost executor 2 on 10.150.143.81: Executor heartbeat timed out after 3658237 ms  
17/12/14 03:29:39 WARN TaskSetManager: Lost task 23.0 in stage 316.0 (TID 9449, 10.150.143.81, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 3658237 ms  
17/12/14 03:29:39 WARN TaskSetManager: Lost task 9.0 in stage 318.0 (TID 9459, 10.150.143.81, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 3658237 ms  
17/12/14 03:29:39 WARN TaskSetManager: Lost task 8.0 in stage 318.0 (TID 9458, 10.150.143.81, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 3658237 ms  
17/12/14 03:29:39 WARN TaskSetManager: Lost task 5.0 in stage 318.0 (TID 9455, 10.150.143.81, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 3658237 ms  
17/12/14 03:29:39 WARN TaskSetManager: Lost task 7.0 in stage 318.0 (TID 9457, 10.150.143.81, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 3658237 ms

Application is not so memory intensive, there are couple of joins and writing dataset to directory. Same code runs on spark-shell without any failure.

Looking for cluster tuning or any configurations settings which will reduce executor getting killed.

FurryMachine FurryMachine · Accepted Answer · 2018-05-25T08:17:36

Firstly I would advise to never allocate a total of 50Gb of RAM to any application if your instance has exactly 50Gb of RAM. The rest of the system applications needs some RAM to work too, and RAM not used by applications is used by the system to cache files and reduce the amount of disk reads. The JVM itself also has a small memory overhead outside of it.

If your spark job uses all the memory, then your instance will inevitably swap, and if it swaps, it will start to behave incorrectly. You can easily check your memory usage and see if your server is swapping by running the command htop. You should also make sure that the swapiness is reduced to 0, so that it doesn't swap unless it really has to.

That's all I can say given the info you provided, if this does not help, you should consider providing more information, like the complete exact parameters of your spark job.

Spark standalone cluster tuning

3 Answers