Submitting jobs to Spark EC2 cluster remotely

Question

I've set up the EC2 cluster with Spark. Everything works, all master/slaves are up and running.

I'm trying to submit a sample job (SparkPi). When I ssh to cluster and submit it from there - everything works fine. However when driver is created on a remote host (my laptop), it doesn't work. I've tried both modes for --deploy-mode:

--deploy-mode=client:

From my laptop:

./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar

Results in the following indefinite warnings/errors:

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 15/02/22 18:30:45

ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0 15/02/22 18:30:45

ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 1

...and failed drivers - in Spark Web UI "Completed Drivers" with "State=ERROR" appear.

I've tried to pass limits for cores and memory to submit script but it didn't help...

--deploy-mode=cluster:

From my laptop:

./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --deploy-mode cluster --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar

The result is:

.... Driver successfully submitted as driver-20150223023734-0007 ... waiting before polling master for driver state ... polling master for driver state State of driver-20150223023734-0007 is ERROR Exception from cluster was: java.io.FileNotFoundException: File file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar does not exist. java.io.FileNotFoundException: File file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150) at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75)

So, I'd appreciate any pointers on what is going wrong and some guidance how to deploy jobs from remote client. Thanks.

UPDATE: So for the second issue in cluster mode, the file must be globally visible by each cluster node, so it has to be somewhere in accessible location. This solve IOException but leads to the same issue as in the client mode.

I think the driver program serves code/stuff to the workers. Is your lappie reachable from your workers? Normally you need the driver program as close as possible to your cluster. — Alister Lee
As @AlisterLee said you should check the settings between your computer and the ec2 nodes (firewall, port settings, etc). If that fails, then you might want to try also taking this to the mailing list (and then reporting back the solution :)) — Justin Pihony

sgvd sgvd · Accepted Answer · 2015-04-15T16:49:23

The documentation at:

http://spark.apache.org/docs/latest/security.html#configuring-ports-for-network-security

lists all the different communication channels used in a Spark cluster. As you can see, there are a bunch where the connection is made from the Executor(s) to the Driver. When you run with --deploy-mode=client, the driver runs on your laptop, so the executors will try to make a connection to your laptop. If the AWS security group that your executors run under blocks outbound traffic to your laptop (which the default security group created by the Spark EC2 scripts doesn't), or you are behind a router/firewall (more likely), they fail to connect and you get the errors you are seeing.

So to resolve it, you have to forward all the necessary ports to your laptop, or reconfigure your firewall to allow connection to the ports. Seeing as a bunch of the ports are chosen at random, this means opening up a wide range of, if not all ports. So probably using --deploy-mode=cluster, or client from the cluster, is less painful.

Submitting jobs to Spark EC2 cluster remotely

2 Answers