0
votes

I have a hadoop cluster with 4 nodes. And I create some hive tables from files stored in hdfs. Then I configure mysql as the hive metastore and copy the hive-site.xml file inside conf folder of spark.

To start the hadoop cluster I started the dfs and also the yarn.sh. Then I created the hive tables, and now Im executing some queries against hive tables from spark sql using hivecontext, like:

var hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
query = hiveContext.sql("select * from customers");
query.show

My doubt is, in this case which cluster manager spark is using? Is the yarn? Because I started the yarn with ./start-yarn.sh command? Or I need to configure something to be yarn and If i didnt it uses another cluster manager as deafult? And in your opinion which cluster is better for this case? Or its indifferent?

1

1 Answers

2
votes

It uses your local, client or cluster based on your --master during spark-submit.

./bin/spark-submit \
  --class myclass \
  --master yarn \
  --deploy-mode cluster \ 
  --executor-memory 20G \
  --num-executors 50 \
  myapp.jar \

or you can specify in the code like below

val conf = new SparkConf()
             .setMaster("yarn-cluster")
             .setAppName("myapp")
val sc = new SparkContext(conf)

If it is spark-shell,

spark-shell --master yarn 

By default, I believe it uses local mode.