0
votes

I am working with HDP 2.6.4, to be more specific Hive 1.2.1 with TEZ 0.7.0 , Spark 2.2.0.

My task is simple. Store data in ORC file format then use Spark to process the data. To achieve this, I am doing this:

  1. Create a Hive table through HiveQL
  2. Use Spark.SQL("select ... from ...") to load data into dataframe
  3. Process against the dataframe

My questions are: 1. What is Hive's role behind the scene? 2. Is it possible to skip Hive?

1

1 Answers

0
votes

You can skip Hive and use SparkSQL to run the command in step 1

In your case, Hive is defining a schema over your data and providing you a query layer for Spark and external clients to communicate

Otherwise, spark.orc exists for reading and writing of dataframes directly on the filesystem