3
votes

I have a spark cluster setup and tried both native scala and spark sql on my dataset and the setup seems to work for the most part. I have the following questions

From an ODBC/extenal connectivity to the cluster, what should i expect? - the admin/developer shapes the data and persists/caches a few RDDs that will be exposed? (Thinking on the lines of hive tables) - What would be the equivalent of connecting to a "Hive metastore" in spark/spark sql?

Is thinking along the lines of hive faulted?

My other question was - when i issue hive queries, (and say create tables and such), it uses the same hive metastore as hadoop/hive - Where do the tables get created when i issue sql queries using sqlcontext? - If i persist the table, it is the same concept as persisting an RDD?

Appreciate your answers

Nithya

1

1 Answers

5
votes

(this is written with spark 1.1 in mind, be aware that new features tend to be added quickly, some limitations mentioned below might very well disappear at some point in the future).

You can use Spark SQL with Hive syntax and connect to Hive metastore, which will result in your Spark SQL hive commands to be executed on the same data space as if they were executed through Hive directly.

To do that you simply need to instantiate a HiveContext as explained here and provide a hive-site.xml configuration file that specifies, among other things, where to find the Hive metastore.

The result of a SELECT statement is a SchemaRDD, which is an RDD of Row objects that has an associated schema. You can use it just like you use any RDD, including cache and persist and the effect is the same (the fact that the data comes from hive has not influence here).

If your hive command is creating data, e.g. "CREATE TABLE ... ", the corresponding table gets created in exactly the same place as with regular Hive, i.e. /var/lib/hive/warehouse by default.

Executing Hive SQL through Spark provides you with all the caching benefits of Spark: executing a 2nd SQL query on the same data set within the same spark context will typically be much faster than the first query.

Since Spark 1.1, it is possible to start the Thrift JDBC server, which is essentially an equivalent to HiveServer2 and thus allows you to execute SparkQL commands through a JDBC connection.

Note that not all Hive features are available (yet?), see details here.

Finally, you can also discard Hive syntax and metastore and execute SQL queries directly on CSV and Parquet files. My best guess is that this will become the preferred approach in the future, although at the moment the set of SQL features available like this is smaller than when using the Hive syntax.