
I have a hadoop cluster with 4 nodes. And I create some hive tables from files stored in hdfs. Then I configure mysql as the hive metastore and copy the hive-site.xml file inside conf folder of spark.

To start the hadoop cluster I started the dfs and also the yarn.sh. Then I created the hive tables, and now Im executing some queries against hive tables from spark sql using hivecontext, like:

var hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
query = hiveContext.sql("select * from customers");

My doubt is, in this case which cluster manager spark is using? Is the yarn? Because I started the yarn with ./start-yarn.sh command? Or I need to configure something to be yarn and If i didnt it uses another cluster manager as deafult? And in your opinion which cluster is better for this case? Or its indifferent?


It uses your local, client or cluster based on your --master during spark-submit.

./bin/spark-submit \
  --class myclass \
  --master yarn \
  --deploy-mode cluster \ 
  --executor-memory 20G \
  --num-executors 50 \
  myapp.jar \

or you can specify in the code like below

val conf = new SparkConf()
val sc = new SparkContext(conf)

If it is spark-shell,

spark-shell --master yarn 

By default, I believe it uses local mode.