I have the following scenario where I read a Parquet file using Spark:
Number of parquet files: 1
Number of blocks (row group) within the file: 3
Size of each block (row group) as below:
blockSize: 195 MB, rowCount: 1395661, compressedSize: 36107 bytes
blockSize: 295 MB, rowCount: 1538519, compressedSize: 38819 bytes
blockSize: 13 MB, rowCount: 52945, compressedSize: 1973 bytes
When I try to read this single Parquet file using Spark, it creates only one partition. Below is the code:
val df = sqlContext.read.parquet(path)
println(df.rdd.getNumPartitions) // result is 1
parquet.block.size = 128 MB
Per my understanding, Hadoop maps one HDFS block to one Parquet block size during read operation and so per this example, it should be mapped to three HDFS blocks. When I try to read this Parquet file using Spark, i was expecting 3 partitions but it resulted in 1 partition and i guess Spark is creating number of partitions based on Parquet file size (which is compressed size) and NOT based on block size within the file.
Question is, why does Spark NOT partition data based on number of blocks/block sizes within a Parquet file and instead it appears to partition by Parquet file size (compressed size)?