Change number of output files using DataFrameWriter in Spark

Question

I have a Dataset that I'm writing out to S3 using the DataFrameWriter. I'm using Parquet and also doing a partitionBy call on a column that has 256 distinct values. It works well but takes some time to write the dataset out (and read into other jobs). In debugging, I noticed that the writer only outputs 256 files, one per suffix, despite my repartition call specifying 256 partitions. Is there a way to increase the number of files output for each partitionBy value?

My code looks like:

myDS = myDS.repartition(256, functions.col("suffix"));
myDS.write().partitionBy("suffix").parquet(String.format(this.outputPath, "parquet", this.date));

cbrown cbrown · Accepted Answer · 2016-12-07T15:35:20

The issue with my code was the presence of specifying a column in my repartition call. Simply removing the column from the repartition call fixed the issue.

The relationship between number of output files per partitionBy value is directly related to the number of partitions. Suppose you have 256 distinct partitionBy values. If you precede your writer with a repartition(5) call, you'll end up with a maximum of 5 output files per partitionBy value. Total number of output files would not exceed 1280 (though it could be less if a there is not much data for a given partitionBy value).

Change number of output files using DataFrameWriter in Spark

1 Answers