Repartition already partitioned dataset effectively in order to combine small files into bigger ones

Question

Is there a way to repartition already partitioned dataset for the sake of reducing number of files within single partition effectively, i.e. without shuffling? For example, if have dataset partitioned by some key:

key=1/
  part1
  ..
  partN
key=2/
  part1
  ..
  partN
..
key=M
  part1
  ..
  partN

I can just do the following:

spark.read
  .parquet("/input")
  .repartition("key")
  .write
  .partitionBy("key")
  .parquet("/output")

I expect that all data from single partition should land in the same executor but it seems to work differently and a lot of shuffling involved. Am I doing something wrong there? Data is stored in Parquet and I'm using Spark 2.4.3.

Mi7flat5 Mi7flat5 · Accepted Answer · 2019-08-15T14:19:59

You need to coalesce before the write.

val n = 1 //number of desired part files
spark.read
  .parquet("/input")
  .repartition($"key") //requires column
  .coalesce(n)
  .write
  .partitionBy("key")
  .parquet("/output")

Repartition already partitioned dataset effectively in order to combine small files into bigger ones

1 Answers