There are several similar-yet-different concepts in Spark-land surrounding how work gets farmed out to different nodes and executed concurrently. Specifically, there is:
- The Spark Driver node (
sparkDriverCount
) - The number of worker nodes available to a Spark cluster (
numWorkerNodes
) - The number of Spark executors (
numExecutors
) - The DataFrame being operated on by all workers/executors, concurrently (
dataFrame
) - The number of rows in the
dataFrame
(numDFRows
) - The number of partitions on the
dataFrame
(numPartitions
) - And finally, the number of CPU cores available on each worker nodes (
numCpuCoresPerWorker
)
I believe that all Spark clusters have one-and-only-one Spark Driver, and then 0+ worker nodes. If I'm wrong about that, please begin by correcting me! Assuming I'm more or less correct about that, let's lock in a few variables here. Let's say we have a Spark cluster with 1 Driver and 4 Worker nodes, and each Worker Node has 4 CPU cores on it (so a total of 16 CPU cores). So the "given" here is:
sparkDriverCount = 1
numWorkerNodes = 4
numCpuCores = numWorkerNodes * numCpuCoresPerWorker = 4 * 4 = 16
Given that as the setup, I'm wondering how to determine a few things. Specifically:
- What is the relationship between
numWorkerNodes
andnumExecutors
? Is there some known/generally-accepted ratio of workers to executors? Is there a way to determinenumExecutors
givennumWorkerNodes
(or any other inputs)? - Is there a known/generally-accepted/optimal ratio of
numDFRows
tonumPartitions
? How does one calculate the 'optimal' number of partitions based on the size of thedataFrame
? - I've heard from other engineers that a general 'rule of thumb' is:
numPartitions = numWorkerNodes * numCpuCoresPerWorker
, any truth to that? In other words, it prescribes that one should have 1 partition per CPU core.