Spark worker nodes timeout

Question

When I run my Spark app using sbt run with configuration pointing to master of a remote cluster nothing useful gets executed by the workers and the following warning is printed in sbt run log repeatedly.

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

This is how my spark config looks like:

@transient lazy val conf: SparkConf = new SparkConf()
    .setMaster("spark://master-ip:7077")
    .setAppName("HelloWorld")
    .set("spark.executor.memory", "1g")
    .set("spark.driver.memory", "12g")

@transient lazy val sc: SparkContext = new SparkContext(conf)

val lines   = sc.textFile("hdfs://master-public-dns:9000/test/1000.csv")

I know this warning usually appears when the cluster is misconfigured and the workers either don't have the resources or aren't started in the first place. However, according to my Spark UI (on master-ip:8080) the worker nodes seem to be alive with sufficient RAM and cpu cores, they even try to execute my app but they exit and leave this in stderr log:

INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; 
users  with view permissions: Set(ubuntu, myuser); 
groups with view permissions: Set(); users  with modify permissions: Set(ubuntu, myuser); groups with modify permissions: Set()

Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
...
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply from 192.168.0.11:35996 in 120 seconds
... 8 more
ERROR RpcOutboxMessage: Ask timeout before connecting successfully

Any ideas?

Check UI @ master_url:8080 there should be some problem like there is no worker or resources are less — Akash Sethi

Hall Wong Hall Wong · Accepted Answer · 2018-05-31T05:59:15

Cannot receive any reply from 192.168.0.11:35996 in 120 seconds

Could you telnet to this port on this ip from worker, maybe your driver machine has multiple network interfaces, try to set SPARK_LOCAL_IP in $SPARK_HOME/conf/spark-env.sh

Spark worker nodes timeout

1 Answers