Get all the new partitions that are written to Hive metastore by Spark

Question

I have a dataframe which I am using to insert into an existing partitioned hive table using spark sql (using dynamic partitioning). Once dataframe has been written, I would like to know what are the partitions my dataframe has just created in hive.

I could query the dataframe for distinct partitions but it takes very long time as it has to start the entire lineage of the dataframe.

I could persist the dataframe before writing to hive, so that, write operation and disctinct partition_column operation happens on top of cached dataframe. But my dataframe is extremely large and dont want to be spending more time in persisting.

I know all the partition information is stored in Hive Metastore. Are there any metastore apis in spark that could help retrieve only the new partitions that were created?

which column you have partitioned the data? check below.. may help stackoverflow.com/questions/36095790/… — vikrant rana

Charlie Flowers Charlie Flowers · Accepted Answer · 2019-07-25T17:10:40

You can use HiveMetastoreClient to retrieve partition data for a table:

import org.apache.hadoop.hive.conf.HiveConf
import scala.collection.JavaConverters._
import org.apache.hadoop.hive.metastore.HiveMetaStoreClient

val hiveConf = new HiveConf(spark.sparkContext.hadoopConfiguration, classOf[HiveConf])
val cli = new HiveMetaStoreClient(hiveConf)

/* Get list of partition values prior to DF insert */
val existingPartitions = cli.listPartitions("<db_name>", "<tbl_name>", Short.MaxValue).asScala.map(_.getValues.asScala.mkString(","))
/* Insert DF contents to table */
df.write.insertInto("<db_name>.<tbl_name>")
/* Fetch list of partition values again, and diff with previous list */
val newPartitions = cli.listPartitions("<db_name>", "<tbl_name>", Short.MaxValue).asScala.map(_.getValues.asScala.mkString(","))
val deltaPartitions = newPartitions.diff(existingPartitions)

Get all the new partitions that are written to Hive metastore by Spark

2 Answers