Loading a spark dataframe into Hive partition

Question

Im trying to load a dataframe into hive table which is partitioned like below.

> create table emptab(id int, name String, salary int, dept String)
> partitioned by (location String)
> row format delimited
> fields terminated by ','
> stored as parquet;

I have a dataframe created in the below format:

val empfile = sc.textFile("emp")
val empdata = empfile.map(e => e.split(","))
case class employee(id:Int, name:String, salary:Int, dept:String)
val empRDD = empdata.map(e => employee(e(0).toInt, e(1), e(2).toint, e(3)))
val empDF = empRDD.toDF()
empDF.write.partitionBy("location").insertInto("/user/hive/warehouse/emptab/location=England")

But Im getting an error as below:

empDF.write.partitionBy("location").insertInto("/user/hive/warehouse/emptab/location=India")
java.lang.RuntimeException: [1.1] failure: identifier expected
/user/hive/warehouse/emptab/location=England

Data in "emp" file:

 ---+-------+------+-----+
| id|   name|salary| dept|
+---+-------+------+-----+
|  1|   Mark|  1000|   HR|
|  2|  Peter|  1200|SALES|
|  3|  Henry|  1500|   HR|
|  4|   Adam|  2000|   IT|
|  5|  Steve|  2500|   IT|
|  6|  Brian|  2700|   IT|
|  7|Michael|  3000|   HR|
|  8|  Steve| 10000|SALES|
|  9|  Peter|  7000|   HR|
| 10|    Dan|  6000|   BS|
+---+-------+------+-----+

Also this is the first time loading the empty Hive table which is partitioned. I am trying to create a partition while loading the data into Hive table. Could anyone tell what is the mistake I am doing here and how can I correct it ?

Thiago Baldim Thiago Baldim · Accepted Answer · 2017-06-21T12:37:42

This is a wrong approach.

When you say the partition path, that is not a "valid" Hadoop path.

What you have to do is:

val empDF = empRDD.toDF()
val empDFFiltered = empDF.filter(empDF.location == "India")
empDFFiltered.write.partitionBy("location").insertInto("/user/hive/warehouse/emptab")

The path will be handle by the partitionBy, if you want only add the information to partition India you should filter the India data from your dataframe.

Loading a spark dataframe into Hive partition

1 Answers