I have a parquet data with 506 partitions. Its size is 6.8GB.
If I simply read spark.read.parquet(<file>)
, I will get 150 partition.
I know that I can set spark.sql.files.maxPartitionBytes
(SPARK-17998)
But even I set the value to 1G
, it still read as 150 partition.
My questions
- How can I read parquet with smaller partition? (like partitionNum = 5) (no
coalesce
/repartition
) - Where the number 150 comes from?
50G / 128M = 400
not 150
My Enviroment
- Spark 3.0.1
- 128 cores