I recently performed ETL on a dataset using spark 2.3.0 in EMR 5.19 where i included a new sorting column. I used the following to do this and noticed that the output was much bigger than the original data set (both compressed parquet).
spark.sql("select * from schema.table where column = 'value'").write.bucketBy(1,"column1").sortBy("column2","column3").option("path"m"/mypath").saveAsTable("table")
I then reran this using the method below and got the expected data size (same as original).
spark.read.load("/originaldata").filter("column='value'").write.bucketBy(1,"column1").sortBy("column2","column3").option("path"m"/mypath").saveAsTable("table")
My write method is identical, but the way i'm bringing the data in is different. However, why is the first result about 4x bigger than the 2nd? Am i not doing the exact same thing either way? Tried to look up the differences between Spark SQL and RDD but can't see anything specifically on writing the data. Note that both the original data set and 2 results are all partitioned the same way (200 parts in all 3).
mode("overwrite")just to make sure there's no appending happening. - Ilya P