0
votes

I have 271 parquet small files (9KB/file) under the same directory on s3 bucket. So, I'm trying to understand how spark gets the number of tasks when reading those files?

The cluster is aws EMR 5.29 and my sparkConf have --num-executors 2 and --executor-cores 2

When I run spark.read.parquet("s3://bucket/path").rdd.getNumPartitions I got 9 tasks/partition, my question is why? How it works?

1
you havent specified any option for you read call ?Ram Ghadiyaram
what is your default parllelism. it might be 9. I tried to read 6 small files from spark.read.... it gave me 2 partitions it went on defaultsRam Ghadiyaram
@RamGhadiyaram, I didn't specified any options, just spark.read.parquet()! The default parallelism is 8! scala> spark.sparkContext.defaultParallelism res0: Int = 8 I change it setting up to spark-shell the conf spark.default.parallelis=2 but It keeps 9 tasks on readBruno Canal
I could see that if I increase the parameter spark.sql.files.maxPartitionBytes to 256mb (the default is 128mb), the tasks will decrease to 8 respecting the default parallelism. But I keeping don't understanding why 9 tasks if my maxPartitionBytes is 128mb?Bruno Canal

1 Answers

0
votes

I found the answer here:

Min(defaultMinSplitSize (128MB, `spark.sql.files.maxPartitionBytes`,
    Max(openCostInByte(8MB, `spark.sql.files.openCostInBytes`,
        totalSize/defaultParallelism)
)