Apache Spark read parquet with specific partition number

Question

I have a parquet data with 506 partitions. Its size is 6.8GB.

If I simply read spark.read.parquet(<file>), I will get 150 partition.

I know that I can set spark.sql.files.maxPartitionBytes (SPARK-17998)

But even I set the value to 1G, it still read as 150 partition.

My questions

How can I read parquet with smaller partition? (like partitionNum = 5) (no coalesce/repartition)
Where the number 150 comes from? 50G / 128M = 400 not 150

My Enviroment

Jesus Sono Jesus Sono · Accepted Answer · 2021-02-25T13:06:52

To your questions:

If you want further information, source.