Querying on multiple Hive stores using Apache Spark

Question

I have a spark application which will successfully connect to hive and query on hive tables using spark engine.

To build this, I just added hive-site.xml to classpath of the application and spark will read the hive-site.xml to connect to its metastore. This method was suggested in spark's mailing list.

So far so good. Now I want to connect to two hive stores and I don't think adding another hive-site.xml to my classpath will be helpful. I referred quite a few articles and spark mailing lists but could not find anyone doing this.

Can someone suggest how I can achieve this?

Thanks.

Docs referred:

Just an idea of a workaround: Hive data reside on HDFS anyway, you can create a DataFrame over a file or dir? I mean, sc.wholeTextFiles('hdfs://host/usr/hive/warehouse/mytable') will give you contents of Hive table. Sure, you will lose comfort of meta-data, but it might work out. — mehmetminanc
I've asked myself the same question, searched a lot, and read some of Spark's code - so far I'm almost convinced this is impossible, at least not in the same SparkContext :( — Tzach Zohar
@karthik-manchala I tried to setup Spark-Hive and ran it HortonWorks Sandbox but i am getting some errors, can you help me out in this. — Sanjay Verma
the code base is located at: github.com/ersanjayverma/spark-hive-solr-demo. I have posted a question at hortonworks community if want to have a look. community.hortonworks.com/questions/65324/… — Sanjay Verma

Aditya Aditya · Accepted Answer · 2016-08-07T09:45:16

I think this is possible by making use of Spark SQL capability of connecting and reading data from remote databases using JDBC.

After an exhaustive R & D, I was successfully able to connect to two different hive environments using JDBC and load the hive tables as DataFrames into Spark for further processing.

Environment details

hadoop-2.6.0

apache-hive-2.0.0-bin

spark-1.3.1-bin-hadoop2.6

Code Sample HiveMultiEnvironment.scala

import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkContext
object HiveMultiEnvironment {
  def main(args: Array[String]) {
    var conf = new SparkConf().setAppName("JDBC").setMaster("local")
    var sc = new SparkContext(conf)
    var sqlContext = new SQLContext(sc)

    // load hive table (or) sub-query from Environment 1

    val jdbcDF1 = sqlContext.load("jdbc", Map(
      "url" -> "jdbc:hive2://<host1>:10000/<db>",
      "dbtable" -> "<db.tablename or subquery>",
      "driver" -> "org.apache.hive.jdbc.HiveDriver",
      "user" -> "<username>",
      "password" -> "<password>"))
    jdbcDF1.foreach { println }
      
    // load hive table (or) sub-query from Environment 2

    val jdbcDF2 = sqlContext.load("jdbc", Map(
      "url" -> "jdbc:hive2://<host2>:10000/<db>",
      "dbtable" -> "<db.tablename> or <subquery>",
      "driver" -> "org.apache.hive.jdbc.HiveDriver",
      "user" -> "<username>",
      "password" -> "<password>"))
    jdbcDF2.foreach { println }
  }
  // todo: business logic
}

Other parameters can also be set during load using SqlContext such as setting partitionColumn. Details found under 'JDBC To Other Databases' section in Spark reference doc: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html

Build path from Eclipse:

What I Haven't Tried

Use of HiveContext for Environment 1 and SqlContext for environment 2

Hope this will be useful.

Querying on multiple Hive stores using Apache Spark

2 Answers