Submit Spark Job to Google Cloud Platform

Question

Has everyone tries deploy Spark using https://console.developers.google.com/project/_/mc/template/hadoop?

Spark installed correctly for me, I can SSH into the hadoop worker or master, spark is installed at /home/hadoop/spark-install/

I can use spark python shell to read file at cloud storage

lines = sc.textFile("hello.txt")

lines.count()

line.first()

but I cannot sucessfully submit the python example to spark cluster, when I run

bin/spark-submit --master spark://hadoop-m-XXX:7077 examples/src/main/python/pi.py 10

I always got

Traceback (most recent call last): File "/Users/yuanwang/programming/spark-1.1.0-bin-hadoop2.4/examples/src/main/python/pi.py", line 38, in count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add) File "/Users/yuanwang/programming/spark-1.1.0-bin-hadoop2.4/python/pyspark/rdd.py", line 759, in reduce vals = self.mapPartitions(func).collect() File "/Users/yuanwang/programming/spark-1.1.0-bin-hadoop2.4/python/pyspark/rdd.py", line 723, in collect bytesInJava = self._jrdd.collect().iterator() File "/Users/yuanwang/programming/spark-1.1.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in call File "/Users/yuanwang/programming/spark-1.1.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o26.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: All masters are unresponsive! Giving up. at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

I am pretty sure I am not connect to Spark cluster correctly, has anyone successfully connect spark cluster on cloud engine?

Can you run simple programs on the Spark shell on your GCE cluster? — Nick Chammas
I am using console.developers.google.com/project/_/mc/template/hadoop, it is running on a hadoop cluster — Yuan Wang
Not sure how clusters launched like that are configured, so I can't tell if it's a config issue or not. As an alternative, you can try this Spark-GCE launch script and see if you get a workable cluster. — Nick Chammas

Matt Bookman Matt Bookman · Accepted Answer · 2014-12-25T05:07:26

You can run jobs from the master:

ssh to the master node:

gcloud compute ssh --zone <zone> hadoop-m-<hash>

and then:

$ cd /home/hadoop/spark-install
$ spark-submit examples/src/main/python/pi.py 10

and somewhere in the output you should see: something like:

Pi is roughly 3.140100

It looks like you are trying to do remote submission of jobs. I'm not sure how you get that to work, but you can submit jobs from on the master.

BTW, as a routine operation, you can validate your spark installation with:

cd /usr/local/share/google/bdutil-0.35.2/extensions/spark
sudo chmod 755 spark-validate-setup.sh
./spark-validate-setup.sh

Submit Spark Job to Google Cloud Platform

1 Answers