I inspected the output parquet file of a spark job that always beaks, because of Out of Memory Errors
.
I use Spark 1.6.0
on Cloudera 5.13.1
I noticed that the parquet row group size is uneven. The first and the last row group are huge. The rest is really small...
Shortened output from parquet-tools RC = row count
, TS = total size
:
row group 1: RC:5740100 TS:566954562 OFFSET:4
row group 2: RC:33769 TS:2904145 OFFSET:117971092
row group 3: RC:31822 TS:2772650 OFFSET:118905225
row group 4: RC:29854 TS:2704127 OFFSET:119793188
row group 5: RC:28050 TS:2356729 OFFSET:120660675
row group 6: RC:26507 TS:2111983 OFFSET:121406541
row group 7: RC:25143 TS:1967731 OFFSET:122069351
row group 8: RC:23876 TS:1991238 OFFSET:122682160
row group 9: RC:22584 TS:2069463 OFFSET:123303246
row group 10: RC:21225 TS:1955748 OFFSET:123960700
row group 11: RC:19960 TS:1931889 OFFSET:124575333
row group 12: RC:18806 TS:1725871 OFFSET:125132862
row group 13: RC:17719 TS:1653309 OFFSET:125668057
row group 14: RC:1617743 TS:157973949 OFFSET:134217728
Is this a known bug? How can I set the parquet block size (row group size) in Spark?
EDIT:
What the Spark application does is: It reads a big AVRO file, then distributes the rows by two partition keys (using distribute by <part_keys>
in the select) and then writes a parquet file for each partition using:DF.write.partitionBy(<part_keys>).parquet(<path>)