How to create a RDD from RC file using data which is partitioned in the hive table

Question

CREATE TABLE employee_details(                                                        
emp_first_name varchar(50),
emp_last_name varchar(50),
emp_dept varchar(50)
)
PARTITIONED BY (
emp_doj varchar(50),
emp_dept_id int  )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'                                 
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'                                       
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat';

Location of the hive table stored is /data/warehouse/employee_details

I have a hive table employee loaded with data and is partitioned by emp_doj ,emp_dept_id and FileFormat is RC file format.

I would like to process the data in the table using the spark-sql without using the hive-context(simply using sqlContext).

Could you please help me in how to load partitioned data of the hive table into an RDD and convert to DataFrame

you can use sqlContext.sql("select * from employee_details") — Shankar

Shankar Shankar · Accepted Answer · 2016-11-02T04:44:38

If you are using Spark 2.0, you can do it in this way.

val spark = SparkSession
  .builder()
  .appName("Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport()
  .getOrCreate()

import spark.implicits._
import spark.sql

// Queries are expressed in HiveQL
sql("SELECT * FROM src").show()

How to create a RDD from RC file using data which is partitioned in the hive table

1 Answers