I have 271 parquet small files (9KB/file) under the same directory on s3 bucket. So, I'm trying to understand how spark gets the number of tasks when reading those files?
The cluster is aws EMR 5.29 and my sparkConf have --num-executors 2
and --executor-cores 2
When I run spark.read.parquet("s3://bucket/path").rdd.getNumPartitions
I got 9 tasks/partition, my question is why? How it works?
spark.read.parquet()
! The default parallelism is 8!scala> spark.sparkContext.defaultParallelism res0: Int = 8
I change it setting up to spark-shell the conf spark.default.parallelis=2 but It keeps 9 tasks on read – Bruno Canalspark.sql.files.maxPartitionBytes
to 256mb (the default is 128mb), the tasks will decrease to 8 respecting the default parallelism. But I keeping don't understanding why 9 tasks if my maxPartitionBytes is 128mb? – Bruno Canal