Accessing Hive Tables from Spark SQL when Data is Stored in Object Storage

Question

I am using spark dataframe writer to write the data in internal hive tables in parquet format in IBM Cloud Object Storage. So , my hive metastore is in HDP cluster and I am running the spark job from the HDP cluster. This spark job writes the data to the IBM COS in parquet format. This is how I am starting the spark session

SparkSession session = SparkSession.builder().appName("ParquetReadWrite")
                                .config("hive.metastore.uris", "<thrift_url>")
                                .config("spark.sql.sources.bucketing.enabled", true)
                                .enableHiveSupport()
                                .master("yarn").getOrCreate();
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.iam.api.key",credentials.get(ConnectionConstants.COS_APIKEY));
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.iam.service.id",credentials.get(ConnectionConstants.COS_SERVICE_ID));
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.endpoint",credentials.get(ConnectionConstants.COS_ENDPOINT));

The issue that I am facing is that when I partition the data and store it (via partitionBy) I am unable to access the data directly from spark sql

spark.sql("select * from partitioned_table").show

To fetch the data from the partitioned table , I have to load the dataframe and register it as a temp table and then query it. The above issue does not occur when the table is not partitioned. The code to write the data is this

dfWithSchema.orderBy(sortKey).write()
                                .partitionBy("somekey")
                                .mode("append")
                                .format("parquet")
                                .option("path",PARQUET_PATH+tableName )
                                .saveAsTable(tableName);

Any idea why the the direct query approach is not working for the partitioned tables in COS/Parquet ?

Raghavendra Gupta Raghavendra Gupta · Accepted Answer · 2018-12-18T11:19:21

To read the partitioned table(created by Spark), you need to give the absolute path of the table as below.

selected_Data=spark.read.format("parquet").option("header","false").load("hdfs/path/loc.db/partition_table")

To filter out it further, please try the below approach.

selected_Data.where(col("column_name")=='col_value').show()

Accessing Hive Tables from Spark SQL when Data is Stored in Object Storage

2 Answers