1
votes

I am reading csv files from s3 and writing into a hive table as orc. While writing, it is writing lot of small files. I need to merge all these files. I have following properties set:

 spark.sql("SET hive.merge.sparkfiles = true")
 spark.sql("SET hive.merge.mapredfiles = true")
 spark.sql("SET hive.merge.mapfiles = true")
 spark.sql("set hive.merge.smallfiles.avgsize = 128000000")
 spark.sql("set hive.merge.size.per.task = 128000000")

Apart from these configurations I tried repartition(1) and coalesce(1) which does the merge into single file but it deletes the hive table and creates it again.

 masterFile.repartition(1).write.mode(SaveMode.Overwrite).partitionBy(<partitioncolumn>).orc(<HIVEtablePath>);

If I use Append mode instead of Overwrite it creates duplication files under each partition.

  masterFile.repartition(1).write.mode(SaveMode.Append).partitionBy(<partitioncolumn>).orc(<HIVEtablePath>);

In both cases the spark job runs twice and fails in second execution.

Is there any way that I can use repartition/coalesce with Append mode without duplication of part file in each partition?

1
Hi, did you find an answer for your question? I have the same problem. - astro_asz

1 Answers

0
votes
masterFile.repartition(1).write.mode(SaveMode.Overwrite).partitionBy(<partitioncolumn>).orc(<HIVEtablePath>)

.orc() method writes data as files and not touch meta information. So it couldn't overwrite table in HIVE.

If you'd like to overwrite data in hive table use method .insertInto(hive_table_name), where hive_table_name is full name of table in HIVE (schema + table_name)

according you example

masterFile.repartition(1).write.mode(SaveMode.Overwrite).partitionBy(<partitioncolumn>).insertInto(hiveTableName)

it's also possible to overwrite data with metadata information. Method .saveAsTable(hive_table_name) with overwrite modifier will overwrite data also in metastore.

masterFile.repartition(1).write.mode(SaveMode.Overwrite).partitionBy(<partitioncolumn>).saveAsTable(hiveTableName)