Load an RDD into hive

Question

I want to load an RDD (k=table_name, v=content) into a partitioned hive table (year,month, day) with pyspark in spark version 1.6.x

The whole while trying to use the logic of this SQL query:

ALTER TABLE db_schema.%FILENAME_WITHOUT_EXTENSION% DROP IF EXISTS PARTITION (year=%YEAR%, month=%MONTH%, day=%DAY%);LOAD DATA INTO TABLE db_schema.%FILENAME_WITHOUT_EXTENSION% PARTITION (year=%YEAR%, month=%MONTH%, day=%DAY%);

Could someone please give some suggestions?

Zhang Tong Zhang Tong · Accepted Answer · 2017-01-10T08:36:35

spark = SparkSession.builder.enableHiveSupport().getOrCreate()
df = spark.sparkContext.parallelize([(1, 'cat', '2016-12-20'), (2, 'dog', '2016-12-21')])
df = spark.createDataFrame(df, schema=['id', 'val', 'dt'])
df.write.saveAsTable(name='default.test', format='orc', mode='overwrite', partitionBy='dt')

Using enableHiveSupport() and df.write.saveAsTable()

Load an RDD into hive

1 Answers