0
votes

I have written following Scala program in Eclipse for reading a csv file from a location in HDFS and then saving that data into a hive table [I am using HDP2.4 sandbox running on my VMWare present on my local machine]:

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext

object HDFS2HiveFileRead {
  def main(args:Array[String]){
    val conf = new SparkConf()
                    .setAppName("HDFS2HiveFileRead")
                    .setMaster("local")

    val sc = new SparkContext(conf)

    val hiveContext = new HiveContext(sc)

    println("loading data") 
    val loadDF = hiveContext.read
                            .format("com.databricks.spark.csv")
                            .option("header","true")
                            .option("delimiter",",")
                            .load("hdfs://192.168.159.129:8020/employee.csv")
    println("data loaded") 
    loadDF.printSchema()

    println("creating table")
    loadDF.write.saveAsTable("%s.%s".format( "default" , "tblEmployee2" ))
    println("table created")

    val selectQuery = "SELECT * FROM default.tblEmployee2"

    println("selecting data")
    val result = hiveContext.sql(selectQuery)

    result.show()}}

When I run this program from my Eclipse; using

Run As -> Scala Application

option: It shows me following results on Eclipse Console:

loading data

data loaded

root

|-- empid: string (nullable = true)

|-- empname: string (nullable = true)

|-- empage: string (nullable = true)

creating table

17/06/29 13:27:08 INFO CatalystWriteSupport: Initialized Parquet WriteSupport with Catalyst schema: { "type" : "struct", "fields" : [ { "name" : "empid", "type" : "string", "nullable" : true, "metadata" : { } }, { "name" : "empname", "type" : "string", "nullable" : true, "metadata" : { } }, { "name" : "empage", "type" : "string", "nullable" : true, "metadata" : { } } ] } and corresponding Parquet message type: message spark_schema { optional binary empid (UTF8); optional binary empname (UTF8); optional binary empage (UTF8); }

table created

selecting data

+-----+--------+------+

|empid| empname|empage|

+-----+--------+------+

| 1201| satish| 25|

| 1202| krishna| 28|

| 1203| amith| 39|

| 1204| javed| 23|

| 1205| prudvi| 23|

+-----+--------+------+

17/06/29 13:27:14 ERROR ShutdownHookManager: Exception while deleting Spark temp dir: C:\Users\c.b\AppData\Local\Temp\spark-c65aa16b-6448-434f-89dc-c318f0797e10 java.io.IOException: Failed to delete: C:\Users\c.b\AppData\Local\Temp\spark-c65aa16b-6448-434f-89dc-c318f0797e10

This shows that csv data has been loaded from desired HDFS location [present in HDP] and table with name tblEmployee2 has also been created in hive, as I could read and see the results in the console. I could even read this table again and again by running any spark job to read data from this table

BUT, the issue is as soon as I go to my HDP2.4 through putty and try to see this table in hive,

1) I could not see this table there.

2) I am considering that this code will create a managed/internal table in hive, hence the csv file present at given location in HDFS should also get moved from its base location to hive metastore location, which is not happening?

3) I could also see metastore_db folder getting created in my Eclipse, does that mean that this tblEmployee2 is getting created in my local/windows machine?

4) How can I resolve this issue and ask my code to create hive table in hdp? Is there any configuration which I am missing here?

5) Why am I getting last error in my execution? Any quick response/pointer would be appreciated.

UPDATE After thinking a lot when I added hiveContext.setConf("hive.metastore.uris","thrift://192.168.159.129:9083")

Code moved a bit but with some permission related issues started appearing. I could now see this table [tblEmployee2] in my hive's default database present in my VMWare but it does that with SparkSQL by itself:

17/06/29 22:43:21 WARN HiveContext$$anon$2: Could not persist `default`.`tblEmployee2` in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format.

Hence, I am still not able to use HiveContext, and my above mentioned issues 2-5 still persists.

Regards, Bhupesh

1

1 Answers

1
votes

You are running the spark in local mode.

 val conf = new SparkConf()
.setAppName("HDFS2HiveFileRead")
                .setMaster("local")

In local mode, when you specify saveAsTable, it will try to create the table in local machine. Change your configuration to run in yarn mode.

You can refer to the below URL, for details: http://www.coding-daddy.xyz/node/7