Clarification about running spark jobs on a cluster (AWS)

Question

I have a HortonWorks cluster running on AWS EC2 machine on which I would like to run a spark job using spark streaming that will swallow the tweet concernings Game of thrones. Before trying to run it on my cluster I did run it locally. The code is working, here it is:

import org.apache.spark.streaming.{StreamingContext, Seconds}
import org.apache.spark.streaming.twitter._
import org.apache.spark.{SparkConf, SparkContext}

object Twitter_Stream extends App {

  val consumerKey = "hidden"
  val consumerSecret = "hidden"
  val accessToken = "hidden"
  val accessTokenSecret = "hidden"

  val sparkConf = new SparkConf().setAppName("GotTweets").setMaster("local[2]")

  val ssc = new StreamingContext(sparkConf, Seconds(1))

  val myStream = TwitterUtils.createStream(ssc, None, Array("#GoT","#WinterIsHere","#GameOfThrones"))

  val rddTweets = myStream.foreachRDD(rdd =>
  {
    rdd.take(10).foreach(println)
  })

  ssc.start()
  ssc.awaitTermination()
}

My question are more precisely about this specific code line :

val sparkConf = new SparkConf().setAppName("GotTweets").setMaster("local[2]")

I replaced the "local[2]" by "spark://ip-address-EC2:7077" wich correspond to one of my ec2 machine but I have a connection failure. I'm sure that the 7077 port is open on this machine.

Also when I run this code with this configuration (setMaster("local[2]")) on one of my EC2 machine , will my spark use all the machine of the cluster or will it run only on a single machine ?

Here the exception :

17/07/24 11:53:42 INFO AppClient$ClientEndpoint: Connecting to master spark://ip-adress:7077... 17/07/24 11:53:44 WARN AppClient$ClientEndpoint: Failed to connect to master ip-adress:7077 java.io.IOException: Failed to connect to spark://ip-adress:7077 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)

To have a better understanding of the problem, could you please report the exception output? — Luis
I don't think the exception will bring more information than the ones I've said but I edited my message. — Robert Reynolds
Can you verify whether you have access to the machine via "telnet ip-address-EC2 7077" — Luis

Luis Luis · Accepted Answer · 2017-07-24T14:57:02

To run spark application using yarn, you should use spark-submit using --master yarn . No need to use setMaster inside scala source code.

Clarification about running spark jobs on a cluster (AWS)

1 Answers