Using, Spark 2.3 on EMR, I'm doing ETL and writing results in scala using dataframe.write.partitionyBy("column1").parquet("location") to write temporary results.
i then read my temporary data into a new frame and add another column to the data set. i write my final results using the below. i add bucketBy and sortBy to sort in order to increase performance on queries on "column2" which is commonly used for joins and other filters .
newdataframe.write.partitionBy("column1").bucketBy(1,"column2").sortBy("column2").option("path","location").saveAsTable("tablename").
the first line gives me 200 parts per partition, each one being 770mb. the 2nd gives me 200 parts per partition, each one 192mb.
Both data sets produce the same metrics (sum of column4), and almost the same number of rows (<0.1% difference).
Why is one result so much smaller than the other even though its the same parquet format, 2nd dataset has 1 more column, and same partition column?
Appreciate any help.