0
votes

I am using spark dataframe writer to write the data in internal hive tables in parquet format in IBM Cloud Object Storage. So , my hive metastore is in HDP cluster and I am running the spark job from the HDP cluster. This spark job writes the data to the IBM COS in parquet format. This is how I am starting the spark session

SparkSession session = SparkSession.builder().appName("ParquetReadWrite")
                                .config("hive.metastore.uris", "<thrift_url>")
                                .config("spark.sql.sources.bucketing.enabled", true)
                                .enableHiveSupport()
                                .master("yarn").getOrCreate();
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.iam.api.key",credentials.get(ConnectionConstants.COS_APIKEY));
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.iam.service.id",credentials.get(ConnectionConstants.COS_SERVICE_ID));
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.endpoint",credentials.get(ConnectionConstants.COS_ENDPOINT));

The issue that I am facing is that when I partition the data and store it (via partitionBy) I am unable to access the data directly from spark sql

spark.sql("select * from partitioned_table").show

To fetch the data from the partitioned table , I have to load the dataframe and register it as a temp table and then query it. The above issue does not occur when the table is not partitioned. The code to write the data is this

dfWithSchema.orderBy(sortKey).write()
                                .partitionBy("somekey")
                                .mode("append")
                                .format("parquet")
                                .option("path",PARQUET_PATH+tableName )
                                .saveAsTable(tableName);

Any idea why the the direct query approach is not working for the partitioned tables in COS/Parquet ?

2

2 Answers

0
votes

To read the partitioned table(created by Spark), you need to give the absolute path of the table as below.

selected_Data=spark.read.format("parquet").option("header","false").load("hdfs/path/loc.db/partition_table")

To filter out it further, please try the below approach.

selected_Data.where(col("column_name")=='col_value').show()
0
votes

This issue occurs when the property hive.metastore.try.direct.sql is set to true on the HiveMetastore configurations and the SparkSQL query is run over a non STRING type partition column.

For Spark, it is recommended to create tables with partition columns of STRING type.

If you are getting below error message while filtering the hive partitioned table in spark.

Caused by: MetaException(message:Filtering is supported only on partition keys of type string)

recreate your hive partitioned table with partition column datatype as string, then you would be able to access the data directly from spark sql.

else you have to specify the absolute path of your hdfs location to get the data incase your partitioned column has been defined as varchar.

selected_Data=spark.read.format("parquet").option("header","false").load("hdfs/path/loc.db/partition_table")

However I was not able to understand, why it's differentiating in between a varchar and string datatype for partition column