I think I might of stumbled across a bug and wanted to get other people's input. I am running a pyspark application using Spark 2.2.0 in standalone mode. I am doing a somewhat heavy transformation in python inside a flatMap and the driver keeps killing the workers.
Here is what am I seeing:
- The master after 60s of not seeing any heartbeat message from the workers it prints out this message to the log:
Removing worker [worker name] because we got no heartbeat in 60 seconds
Removing worker [worker name] on [IP]:[port]
Telling app of lost executor: [executor number]
- I then see in the driver log the following message:
Lost executor [executor number] on [executor IP]: worker lost
- The worker then terminates and I see this message in its log:
Driver commanded a shutdown
I have looked at the Spark source code and from what I can tell, as long as the executor is alive it should send a heartbeat message back as it is using a ThreadUtils.newDaemonSingleThreadScheduledExecutor to do this.
One other thing that I noticed while I was running top on one of the workers, is that the executor JVM seems to be suspended throughout this process. There are as many python processes as I specified in the SPARK_WORKER_CORES env variable and each is consuming close to 100% of the CPU.
Anyone have any thoughts on this?