Hive SaveAsTable creates a new Parquet table file for every run

Question

I have the following Scala code that I use to write data from a json file to a table in Hive. import org.apache.spark.SparkConf import org.apache.spark.sql.SQLContext

val conf = new SparkConf().setAppName("App").setMaster("local")

import org.apache.spark.sql.hive._

val hiveContext = new HiveContext(sc)
val stg_comments = hiveContext.read.schema(buildSchema()).json(<path to json file)

comment.write.mode("append").saveAsTable(<table name>)

My json data has newline and carriage return characters in it's field values and hence, I cannot simply insert records in Hive (because Hive tables by default do not store newline and carriage returns in the data values) and hence, I need to use SaveAsTable option. The issue here is that every time a json file is read and new records are appended to the existing table, a new parquet file is created in the table directory in Hive warehouse directory. This leads to really small small parquet files in the directory. I would like the data to be appended to the existing parquet file. Do we know how to do that? Thanks!

Unknown Unknown · Accepted Answer · 2018-05-21T05:32:52

This is an expected behavior. There is no append-to-existing file option here. Each job has its own set of tasks, each task has its own output file. repartitioning before rewrite can reduce number of files written, but not prevent creating new files.

If number of files becomes a problem, you have to run a separate job to read existing small files and merge into larger chunks.

Hive SaveAsTable creates a new Parquet table file for every run

1 Answers