0
votes

I recently performed ETL on a dataset using spark 2.3.0 in EMR 5.19 where i included a new sorting column. I used the following to do this and noticed that the output was much bigger than the original data set (both compressed parquet).

spark.sql("select * from schema.table where column = 'value'").write.bucketBy(1,"column1").sortBy("column2","column3").option("path"m"/mypath").saveAsTable("table")

I then reran this using the method below and got the expected data size (same as original).

spark.read.load("/originaldata").filter("column='value'").write.bucketBy(1,"column1").sortBy("column2","column3").option("path"m"/mypath").saveAsTable("table")

My write method is identical, but the way i'm bringing the data in is different. However, why is the first result about 4x bigger than the 2nd? Am i not doing the exact same thing either way? Tried to look up the differences between Spark SQL and RDD but can't see anything specifically on writing the data. Note that both the original data set and 2 results are all partitioned the same way (200 parts in all 3).

1
How does RDD figure here? - thebluephantom
Did you do the counts? Not sure if the paths for both the runs or if it is just appending the data, @llya P - Sai
i'm using different paths for each approach, and there's a mode("overwrite") just to make sure there's no appending happening. - Ilya P

1 Answers

0
votes

after getting the same larger-than-expected result with these approaches, i switched to this instead

spark.read.load("/originaldata").filter("column='value'").sort("column1","column2").write.save("/location")

this works as expected and does not fail. also does not use any unnecessary Hive saveAsTable features. a better option than sortBy which also requires bucketBy and saveAsTable