I am trying to submit a job(Kmeans clustering in python) to my spark standalone cluster on EC2. It has 18 nodes. I am using the latest version of spark(1.4.0).
I submit the job from the master using :
SPARK_WORKER_INSTANCES=30 SPARK_WORKER_CORES=4 SPARK_WORKER_MEMORY=30g SPARK_MEM=30g OUR_JAVA_MEM="30g" SPARK_DAEMON_JAVA_OPTS="-XX:MaxPermSize=30g - Xms30g -Xmx30g" ./spark/bin/spark-submit app.py --master spark://ec2-54-174-186-17.compute-1.amazonaws.com:7077 --executor-memory 500G --total-executor-cores 144
I get the following error in the workers:
15/06/17 21:10:01 INFO executor.Executor: Finished task 132.0 in stage 23.0 (TID 3444). 5802749 bytes result sent to driver
15/06/17 21:10:06 ERROR executor.CoarseGrainedExecutorBackend: Driver 172.31.23.236:41498 disassociated! Shutting down.
15/06/17 21:10:06 INFO storage.DiskBlockManager: Shutdown hook called
15/06/17 21:10:06 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[email protected]:41498] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/06/17 21:10:06 INFO util.Utils: Shutdown hook called
Also in the master I see the following :
> URL: spark://ec2-54-174-186-17.compute-1.amazonaws.com:7077 REST URL:
> spark://ec2-54-174-186-17.compute-1.amazonaws.com:6066 (cluster mode)
> Workers: 18 Cores: 144 Total, 144 Used Memory: 507.7 GB Total, 471.7
> GB Used Applications: 1 Running, 8 Completed Drivers: 0 Running, 0
> Completed Status: ALIVE
Looking around a bit , I read that CoarseGrainedExecutorBackend occurs when the executor is unable to communicate with the driver. I am able to access the spark ui at http://ec2-54-174-186-17.compute-1.amazonaws.com:4040 .But I am not sure if the driver is running. Kindly let me know what I am doing wrong. Thanks a lot.