I have written following Scala program in Eclipse for reading a csv file from a location in HDFS and then saving that data into a hive table [I am using HDP2.4 sandbox running on my VMWare present on my local machine]:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
object HDFS2HiveFileRead {
def main(args:Array[String]){
val conf = new SparkConf()
.setAppName("HDFS2HiveFileRead")
.setMaster("local")
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)
println("loading data")
val loadDF = hiveContext.read
.format("com.databricks.spark.csv")
.option("header","true")
.option("delimiter",",")
.load("hdfs://192.168.159.129:8020/employee.csv")
println("data loaded")
loadDF.printSchema()
println("creating table")
loadDF.write.saveAsTable("%s.%s".format( "default" , "tblEmployee2" ))
println("table created")
val selectQuery = "SELECT * FROM default.tblEmployee2"
println("selecting data")
val result = hiveContext.sql(selectQuery)
result.show()}}
When I run this program from my Eclipse; using
Run As -> Scala Application
option: It shows me following results on Eclipse Console:
loading data
data loaded
root
|-- empid: string (nullable = true)
|-- empname: string (nullable = true)
|-- empage: string (nullable = true)
creating table
17/06/29 13:27:08 INFO CatalystWriteSupport: Initialized Parquet WriteSupport with Catalyst schema: { "type" : "struct", "fields" : [ { "name" : "empid", "type" : "string", "nullable" : true, "metadata" : { } }, { "name" : "empname", "type" : "string", "nullable" : true, "metadata" : { } }, { "name" : "empage", "type" : "string", "nullable" : true, "metadata" : { } } ] } and corresponding Parquet message type: message spark_schema { optional binary empid (UTF8); optional binary empname (UTF8); optional binary empage (UTF8); }
table created
selecting data
+-----+--------+------+
|empid| empname|empage|
+-----+--------+------+
| 1201| satish| 25|
| 1202| krishna| 28|
| 1203| amith| 39|
| 1204| javed| 23|
| 1205| prudvi| 23|
+-----+--------+------+
17/06/29 13:27:14 ERROR ShutdownHookManager: Exception while deleting Spark temp dir: C:\Users\c.b\AppData\Local\Temp\spark-c65aa16b-6448-434f-89dc-c318f0797e10 java.io.IOException: Failed to delete: C:\Users\c.b\AppData\Local\Temp\spark-c65aa16b-6448-434f-89dc-c318f0797e10
This shows that csv data has been loaded from desired HDFS location [present in HDP] and table with name tblEmployee2 has also been created in hive, as I could read and see the results in the console. I could even read this table again and again by running any spark job to read data from this table
BUT, the issue is as soon as I go to my HDP2.4 through putty and try to see this table in hive,
1) I could not see this table there.
2) I am considering that this code will create a managed/internal table in hive, hence the csv file present at given location in HDFS should also get moved from its base location to hive metastore location, which is not happening?
3) I could also see metastore_db folder getting created in my Eclipse, does that mean that this tblEmployee2 is getting created in my local/windows machine?
4) How can I resolve this issue and ask my code to create hive table in hdp? Is there any configuration which I am missing here?
5) Why am I getting last error in my execution? Any quick response/pointer would be appreciated.
UPDATE After thinking a lot when I added hiveContext.setConf("hive.metastore.uris","thrift://192.168.159.129:9083")
Code moved a bit but with some permission related issues started appearing. I could now see this table [tblEmployee2] in my hive's default database present in my VMWare but it does that with SparkSQL by itself:
17/06/29 22:43:21 WARN HiveContext$$anon$2: Could not persist `default`.`tblEmployee2` in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format.
Hence, I am still not able to use HiveContext, and my above mentioned issues 2-5 still persists.
Regards, Bhupesh