The two relevant parameters seem to me to be spark.default.parallelism
and spark.cores.max
.
spark.default.parallelism
sets the number of partitions of the in-memory data, and spark.cores.max
sets the number of available CPU cores. However, in traditional parallel computing, I would specifically launch some number of threads.
Will Spark launch one thread per partition, regardless of the number of available cores? If there are 1 million partitions, will Spark limit the number of threads to some reasonable multiple of the number of available cores?
How is the number of threads determined?