Why does Spark NOT create partitions based on Parquet block size on read? (Instead it appears to partition by Parquet file compressed size)

Question

I have the following scenario where I read a Parquet file using Spark:

Number of parquet files: 1

Number of blocks (row group) within the file: 3

Size of each block (row group) as below:

blockSize: 195 MB, rowCount: 1395661, compressedSize: 36107 bytes
blockSize: 295 MB, rowCount: 1538519, compressedSize: 38819 bytes
blockSize: 13 MB, rowCount: 52945, compressedSize: 1973 bytes

When I try to read this single Parquet file using Spark, it creates only one partition. Below is the code:

val df = sqlContext.read.parquet(path)
println(df.rdd.getNumPartitions) // result is 1

parquet.block.size = 128 MB

Per my understanding, Hadoop maps one HDFS block to one Parquet block size during read operation and so per this example, it should be mapped to three HDFS blocks. When I try to read this Parquet file using Spark, i was expecting 3 partitions but it resulted in 1 partition and i guess Spark is creating number of partitions based on Parquet file size (which is compressed size) and NOT based on block size within the file.

Question is, why does Spark NOT partition data based on number of blocks/block sizes within a Parquet file and instead it appears to partition by Parquet file size (compressed size)?

Have a look at this animeshtrivedi.github.io/spark-parquet-reading and note that parquet is a columnar format. — thebluephantom
Source is HDFS.. Yes, parquet is columnar format and hence I expect spark to create partitions based on block size within parquet file format.. dremio.com/tuning-parquet — Naresh Krishnamoorthy

Chris Chris · Accepted Answer · 2020-05-17T06:36:34

The size of a partition in Spark is dictated by spark.sql.files.maxPartitionBytes . The default is 128 MB.

Damji, Jules S.,Wenig, Brooke,Das, Tathagata,Lee, Denny. Learning Spark (pp. 264-265). O'Reilly Media. Kindle Edition.

Note that, a corollary of the above quote is that the partitioning of a Spark Dataframe is independent of the layout of the file that the Dataframe is created from.

As you only have one partition after read, it looks like you have a value set for maxPartitionBytes that is larger than the file you are reading.

Why does Spark NOT create partitions based on Parquet block size on read? (Instead it appears to partition by Parquet file compressed size)

1 Answers