How to overwrite a partition in apache spark 2.3 while still writing to parquet with insertInto method

Question

I saw this example code to overwrite a partition through spark 2.3 really nicely

dfPartition.coalesce(coalesceNum).write.mode("overwrite").format("parquet").insertInto(tblName)

My issue is that even after adding .format("parquet") it is not being written as parquet rather .c000 .

The compaction and overwriting of the partition if working but not the writing as parquet.

Fullc code here

val sparkSession = SparkSession.builder //.master("local[2]")
    .config("spark.hadoop.parquet.enable.summary-metadata", "false")
    .config("hive.exec.dynamic.partition", "true")
    .config("hive.exec.dynamic.partition.mode", "nonstrict")
    .config("parquet.compression", "snappy")
    .enableHiveSupport() //can just comment out hive support
    .getOrCreate
  sparkSession.sparkContext.setLogLevel("ERROR")
  println("Created hive Context")
  val currentUtcDateTime = new DateTime(DateTimeZone.UTC)
  //to compact yesterdays partition
  val partitionDtKey = currentUtcDateTime.minusHours(24).toString("yyyyMMdd").toLong

  val dfPartition = sparkSession.sql(s"select * from $tblName where $columnPartition=$hardCodedPartition")

  if (!dfPartition.take(1).isEmpty) {
    sparkSession.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")

    dfPartition.coalesce(coalesceNum).write.format("parquet").mode("overwrite").insertInto(tblName)
    sparkSession.sql(s"msck repair table $tblName")
    Helpers.executeQuery("refresh " + tblName, "impala", resultRequired = false)
  }
  else {
    "echo invalid partition"
  }

here is the question where I got the suggestion to use this code Overwrite specific partitions in spark dataframe write method.

What I like about this method is not having to list the partition columns which is really good nice. I can easily use it in many cases

Using scala 2.11 , cdh 5.12 , spark 2.3

Any suggestions

Madhava Carrillo Madhava Carrillo · Accepted Answer · 2019-02-08T11:13:09

The extension .c000 relates to the executor who did the file, not to the actual file format. The file could be parquet and end with .c000, or .snappy, or .zip... To know the actual file format, run this command:

hadoop dfs -cat /tmp/filename.c000 | head

where /tmp/filename.c000 is the hdfs path to your file. You will see some strange simbols, and you should see parquet there somewhere if its actually a parquet file.

How to overwrite a partition in apache spark 2.3 while still writing to parquet with insertInto method

1 Answers