I am running spark-sql on a hive table.
It runs successfully when the spark-shell is started with the following parameters,
"--driver-memory 8G --executor-memory 10G --executor-cores 1 --num-executors 30"
however the job hangs when the spark-shell is started with
"--driver-memory 8G --executor-memory 10G --executor-cores 1 --num-executors 40"
The difference is only in the number of executors (30 vs 40).
In the second case i see that there is 1 task active on each executor but it does not run. I do not see any "task completed" messages on the spark-shell.
The job runs successfully with number of executors below 30.
My yarn cluster has 42 nodes and 30 cores per node and about 50G memory per node.
Any pointers to where I have to look ?
I compared the debug level logs from both the runs, the runs that appeared to hang did not have any such log lines. The good runs had a bunch of these lines.
"org.apache.spark.storage.BlockManager logDebug - Level for block broadcast_0_piece0 is StorageLevel(true, true, false, false, 1)" "org.apache.spark.storage.BlockManager logDebug - Level for block broadcast_1 is StorageLevel(true, true, false, true, 1)"