0
votes

I ran into an issue in which apparently performing a full outer join with an empty table in Spark SQL results in a much larger file size than simply selecting columns from the other dataset without doing a join.

Basically, I had two datasets, one of which was very large, and the other was empty. I went through and selected nearly all columns from the large dataset, and full-outer-joined it to the empty dataset. Then, I wrote out the resulting dataset to snappy-compressed parquet (I also tried with snappy-compressed orc). Alternatively, I simply selected the same columns from the large dataset and then saved the resulting dataset as snappy-compressed parquet (or orc) as above. The file sizes were drastically different, in that the file from the empty-dataset-join was nearly five times bigger than the simple select file.

I've tried this on a number of different data sets, and get the same results. In looking at the data:

  • Number of output rows are the same (verified with spark-shell by reading in the output datasets and doing a count)
  • Schemas are the same (verified with spark-shell, parquet-tools, and orc-tools)
  • Spot-checking the data looks the same, and I don't see any crazy data in either type of output
  • I explicitly saved all files with the same, snappy, compression, and output files are given '.snappy.parquet' extensions by Spark

I understand that doing a join with an empty table is effectively pointless (I was doing so as part of some generic code that always performed a full outer join and sometimes encountered empty datasets). And, I've updated my code so it no longer does this, so the problem is fixed.

Still, I would like to understand why / how this could be happening. So my question is -- why would doing a Spark SQL join with an empty dataset result in a larger file size? And/or any ideas about how to figure out what is making the resulting parquet files so large would also be helpful.

1

1 Answers

1
votes

After coming across several other situations where seemingly small differences in the way data was processed resulted in big difference in file size (for seemingly the exact same data), I finally figured this out.

The key here is understanding how parquet (or orc) encodes data using potentially sophisticated formats such as dictionary encoding, run-length encoding, etc. These formats take advantage of data redundancy to make file size smaller, i.e., if your data contains lots of similar values, the file size will be smaller than for many distinct values.

In the case of joining with an empty dataset, an important point is that when spark does a join, it partitions on the join column. So doing a join, even with an empty dataset, may change the way data is partitioned.

In my case, joining with an empty dataset changed the partitioning from one where many similar records were grouped together in each partition, to one where many dissimilar records were grouped in each partition. Then, when these partitions were written out, many similar or dissimilar records were put together in each file. When partitions had similar records, parquet's encoding was very efficient and file size was small; when partitions were diverse, parquet's encoding couldn't be as efficient, and file size was larger -- even though the data overall was exactly the same.

As mentioned above, we ran into several other instances where the same problem showed up -- we'd change one aspect of processing, and would get the same output data, but the file size might be four times larger. One thing that helped in figuring this out was using parquet-tools to look at low-level file statistics.