1
votes

I have spark conf as:

sparkConf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")    
sparkConf.set("hive.exec.dynamic.partition", "true")
sparkConf.set("hive.exec.dynamic.partition.mode", "nonstrict")

I am using the spark context to write the parquet files into hdfs location as:

df.write.partitionBy('asofdate').mode('append').parquet('parquet_path')

In hdfs location, the parquet files are stored as 'asofdate' but in hive table I have to do 'MSCK REPAIR TABLE <tbl_name>' everyday. I am looking for a solution on how I can do recover table for every new partitions using spark script (or at the time of partition creation itself).

1

1 Answers

0
votes

It's better if you integrate hive with spark to make your job easier.

After the hive-spark integration setup, you can enable hive support while creating SparkSession.

  spark = SparkSession.builder.enableHiveSupport().getOrCreate()

Now you can access hive tables from spark. You can run repair command from spark itself.

spark.sql("MSCK REPAIR TABLE <tbl_name>")

I would suggest to write dataframe directly as a hive table instead of writing it to parquet and do repair table.

df.write.partitionBy("<partition_column>").mode("append").format("parquet").saveAsTable("<table>")