Plenty of questions have been already posed about the number of Spark tasks and how this relates to the number of partitions. But somehow I cannot understand the following case.
I have a Hive table (an HDFS folder) which contains 160 Parquet-compressed files. The files are mostly well balanced: smallest is 7.5MB, largest 49.2MB. In the HDFS browser I see each file is within 1 (non-full) HDFS block (128MB).
The cluster has the following properties: 10 machines, 1 master and 9 workers. Each machine has 6 Cores (12 virtual cores). I am using Yarn. Moreover:
spark.executor.cores = 6
Now, I create the following dataframe:
val myDF = spark.sql("SELECT * FROM myHiveTable WHERE myCol='someValue')
Even before the job is triggered, it is possible to know in advance that:
myDF.rdd.partitions.size
returns 60.
To trigger the job one needs an action, so I write "myDF" to HDFS. The job indeed results in 42 Executors and 60 tasks.
My questions:
If I started with 160 partitions, how come I ended up having 60?
If I have 60 tasks and 10 machines, then I would optimally need only 10 executors (somewhere I read that each executor can run as many tasks in parallel as the number of cores, which in my case is 6). I know that this would happen only if the dataset is perfectly balanced among the datanodes, but 42 Executors seems to me far away from 10. Or is my reasoning wrong?
How can Spark know in advance, even before running the query, that this will result in 60 partitions.
Thank you!