I have a Dataset that I'm writing out to S3 using the DataFrameWriter. I'm using Parquet and also doing a partitionBy call on a column that has 256 distinct values. It works well but takes some time to write the dataset out (and read into other jobs). In debugging, I noticed that the writer only outputs 256 files, one per suffix, despite my repartition
call specifying 256 partitions. Is there a way to increase the number of files output for each partitionBy value?
My code looks like:
myDS = myDS.repartition(256, functions.col("suffix"));
myDS.write().partitionBy("suffix").parquet(String.format(this.outputPath, "parquet", this.date));