How spark get number of tasks reading parquet files?

Question

I have 271 parquet small files (9KB/file) under the same directory on s3 bucket. So, I'm trying to understand how spark gets the number of tasks when reading those files?

The cluster is aws EMR 5.29 and my sparkConf have --num-executors 2 and --executor-cores 2

When I run spark.read.parquet("s3://bucket/path").rdd.getNumPartitions I got 9 tasks/partition, my question is why? How it works?

what is your default parllelism. it might be 9. I tried to read 6 small files from spark.read.... it gave me 2 partitions it went on defaults — Ram Ghadiyaram
@RamGhadiyaram, I didn't specified any options, just spark.read.parquet()! The default parallelism is 8! scala> spark.sparkContext.defaultParallelism res0: Int = 8 I change it setting up to spark-shell the conf spark.default.parallelis=2 but It keeps 9 tasks on read — Bruno Canal
I could see that if I increase the parameter spark.sql.files.maxPartitionBytes to 256mb (the default is 128mb), the tasks will decrease to 8 respecting the default parallelism. But I keeping don't understanding why 9 tasks if my maxPartitionBytes is 128mb? — Bruno Canal

Bruno Canal Bruno Canal · Accepted Answer · 2020-05-04T15:18:57

I found the answer here:

Min(defaultMinSplitSize (128MB, `spark.sql.files.maxPartitionBytes`,
    Max(openCostInByte(8MB, `spark.sql.files.openCostInBytes`,
        totalSize/defaultParallelism)
)

How spark get number of tasks reading parquet files?

1 Answers