I ran into an issue in which apparently performing a full outer join with an empty table in Spark SQL results in a much larger file size than simply selecting columns from the other dataset without doing a join.
Basically, I had two datasets, one of which was very large, and the other was empty. I went through and selected nearly all columns from the large dataset, and full-outer-joined it to the empty dataset. Then, I wrote out the resulting dataset to snappy-compressed parquet (I also tried with snappy-compressed orc). Alternatively, I simply selected the same columns from the large dataset and then saved the resulting dataset as snappy-compressed parquet (or orc) as above. The file sizes were drastically different, in that the file from the empty-dataset-join was nearly five times bigger than the simple select file.
I've tried this on a number of different data sets, and get the same results. In looking at the data:
- Number of output rows are the same (verified with spark-shell by reading in the output datasets and doing a count)
- Schemas are the same (verified with spark-shell, parquet-tools, and orc-tools)
- Spot-checking the data looks the same, and I don't see any crazy data in either type of output
- I explicitly saved all files with the same, snappy, compression, and output files are given '.snappy.parquet' extensions by Spark
I understand that doing a join with an empty table is effectively pointless (I was doing so as part of some generic code that always performed a full outer join and sometimes encountered empty datasets). And, I've updated my code so it no longer does this, so the problem is fixed.
Still, I would like to understand why / how this could be happening. So my question is -- why would doing a Spark SQL join with an empty dataset result in a larger file size? And/or any ideas about how to figure out what is making the resulting parquet files so large would also be helpful.