What's the difference between Spark using Hive metastore and Spark running as hive execution engine? I have followed THIS TUTORIAL to configure spark and hive, and I have successfully created, populated and analysed data from hive table. Now what confuses me is what have I done?
a) Did I configure Spark to use Hive metastore and analysed data in hive table using SparkSQL?
b) Did I actually used Spark as Hive execution engine and analysed data in
hive table using HiveQL,which is what I want to do.
I will try to summarize what I have done to configure spark and hive
a) I followed that above tutorial and configured spark and hive
b) Wrote my /conf/hive-site.xml Like this and
c) After that I wrote some codes that would connect to hive metastore and do my analysis. I am using java for this and this piece of code starts spark session
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.enableHiveSupport()
.config("spark.sql.warehouse.dir", "hdfs://saurab:9000/user/hive/warehouse")
.config("mapred.input.dir.recursive", true)
.config("hive.mapred.supports.subdirectories", true)
.config("spark.sql.hive.thriftServer.singleSession", true)
.master("local")
.getOrCreate();
And this piece of code will create database and table. Here db=mydb and table1=mytbl
String query = "CREATE DATABASE IF NOT EXISTS " + db;
spark.sql(query);
String query = "CREATE EXTERNAL TABLE IF NOT EXISTS " + db + "." + table1
+ " (icode String, " +
"bill_date String, " +
"total_amount float, " +
"bill_no String, " +
"customer_code String) " +
"COMMENT \" Sales details \" " +
"ROW FORMAT DELIMITED FIELDS TERMINATED BY \",\" " +
"LINES TERMINATED BY \"\n\" " +
"STORED AS TEXTFILE " +
"LOCATION 'hdfs://saurab:9000/ekbana2/' " +
"tblproperties(\"skip.header.line.count\"=\"1\")";
spark.sql(query);
Then I create jar and run it using spark-submit
./bin/spark-submit --master yarn --jars jars/datanucleus-api-jdo-3.2.6.jar,jars/datanucleus-core-3.2.10.jar,jars/datanucleus-rdbms-3.2.9.jar,/home/saurab/hadoopec/hive/lib/mysql-connector-java-5.1.38.jar --verbose --properties-file /home/saurab/hadoopec/spark/conf/spark-env.sh --files /home/saurab/hadoopec/spark/conf/hive-site.xml --class HiveRead /home/saurab/sparkProjects/spark_hive/target/myJar-jar-with-dependencies.jar
Doing this I get what I want but I am not very sure I am doing what I really want to do. My question might seem somewhat difficult to understand because I don't know how to explain it.If so please comment and I will try to expand my question
Also if there is any tutorial that focuses on spark+hive working, please provide me link and I also want to know if spark reads spark/conf/hive-site.xml
or hive/conf/hive-site.xml
because I am confused where to set hive.execution.engine=spark
.
Thanks