Is there a way to repartition already partitioned dataset for the sake of reducing number of files within single partition effectively, i.e. without shuffling? For example, if have dataset partitioned by some key:
key=1/
part1
..
partN
key=2/
part1
..
partN
..
key=M
part1
..
partN
I can just do the following:
spark.read
.parquet("/input")
.repartition("key")
.write
.partitionBy("key")
.parquet("/output")
I expect that all data from single partition should land in the same executor but it seems to work differently and a lot of shuffling involved. Am I doing something wrong there? Data is stored in Parquet and I'm using Spark 2.4.3.