I am reading csv files from s3 and writing into a hive table as orc. While writing, it is writing lot of small files. I need to merge all these files. I have following properties set:
spark.sql("SET hive.merge.sparkfiles = true")
spark.sql("SET hive.merge.mapredfiles = true")
spark.sql("SET hive.merge.mapfiles = true")
spark.sql("set hive.merge.smallfiles.avgsize = 128000000")
spark.sql("set hive.merge.size.per.task = 128000000")
Apart from these configurations I tried repartition(1) and coalesce(1) which does the merge into single file but it deletes the hive table and creates it again.
masterFile.repartition(1).write.mode(SaveMode.Overwrite).partitionBy(<partitioncolumn>).orc(<HIVEtablePath>);
If I use Append mode instead of Overwrite it creates duplication files under each partition.
masterFile.repartition(1).write.mode(SaveMode.Append).partitionBy(<partitioncolumn>).orc(<HIVEtablePath>);
In both cases the spark job runs twice and fails in second execution.
Is there any way that I can use repartition/coalesce with Append mode without duplication of part file in each partition?