Why am I getting "Removing worker because we got no heartbeat in 60 seconds" on Spark master

Question

I think I might of stumbled across a bug and wanted to get other people's input. I am running a pyspark application using Spark 2.2.0 in standalone mode. I am doing a somewhat heavy transformation in python inside a flatMap and the driver keeps killing the workers.

Here is what am I seeing:

The master after 60s of not seeing any heartbeat message from the workers it prints out this message to the log:

Removing worker [worker name] because we got no heartbeat in 60 seconds

Removing worker [worker name] on [IP]:[port]

Telling app of lost executor: [executor number]

I then see in the driver log the following message:

Lost executor [executor number] on [executor IP]: worker lost

The worker then terminates and I see this message in its log:

Driver commanded a shutdown

I have looked at the Spark source code and from what I can tell, as long as the executor is alive it should send a heartbeat message back as it is using a ThreadUtils.newDaemonSingleThreadScheduledExecutor to do this.

One other thing that I noticed while I was running top on one of the workers, is that the executor JVM seems to be suspended throughout this process. There are as many python processes as I specified in the SPARK_WORKER_CORES env variable and each is consuming close to 100% of the CPU.

Anyone have any thoughts on this?

Forgot to add that if I increase the spark.worker.timeout to something really large on the master and spark.network.timeout when I submit the pyspark application, then the application succeeds with no problems. — Antonio Ye

Abhishek Baranwal Abhishek Baranwal · Accepted Answer · 2021-06-20T08:43:49

I was facing this same issue, increasing interval worked.

Excerpt from Logs start-all.sh logs

INFO Utils: Successfully started service 'sparkMaster' on port 7077.
INFO Master: Starting Spark master at spark://master:7077
INFO Master: Running Spark version 3.0.1
INFO Utils: Successfully started service 'MasterUI' on port 8080.
INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://master:8080
INFO Master: I have been elected leader! New state: ALIVE
INFO Master: Registering worker slave01:41191 with 16 cores, 15.7 GiB RAM
INFO Master: Registering worker slave02:37853 with 16 cores, 15.7 GiB RAM
WARN Master: Removing worker-20210618205117-slave01-41191 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618205117-slave01-41191 on slave01:41191
INFO Master: Telling app of lost worker: worker-20210618205117-slave01-41191
WARN Master: Removing worker-20210618204723-slave02-37853 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618204723-slave02-37853 on slave02:37853
INFO Master: Telling app of lost worker: worker-20210618204723-slave02-37853
WARN Master: Got heartbeat from unregistered worker worker-20210618205117-slave01-41191. This worker was never registered, so ignoring the heartbeat.
WARN Master: Got heartbeat from unregistered worker worker-20210618204723-slave02-37853. This worker was never registered, so ignoring the heartbeat.

Solution: add following configs to $SPARK_HOME/conf/spark-defaults.conf

spark.network.timeout 50000
spark.executor.heartbeatInterval 5000
spark.worker.timeout 5000

Why am I getting "Removing worker because we got no heartbeat in 60 seconds" on Spark master

1 Answers