0
votes

I have a parquet data with 506 partitions. Its size is 6.8GB.

If I simply read spark.read.parquet(<file>), I will get 150 partition.

I know that I can set spark.sql.files.maxPartitionBytes (SPARK-17998)

But even I set the value to 1G, it still read as 150 partition.

My questions

  1. How can I read parquet with smaller partition? (like partitionNum = 5) (no coalesce/repartition)
  2. Where the number 150 comes from? 50G / 128M = 400 not 150

My Enviroment

  • Spark 3.0.1
  • 128 cores
1

1 Answers

0
votes

To your questions:

  1. Read the parquet and use, df.coalesce().
  2. Take a look to the spark.sql.shuffle.partitions option.

If you want further information, source.