0
votes

I am starting to get more confused as I keep reading online resources about Spark architecture and scheduling. One resource says that: The number of tasks in a stage is the same as the number of partitions in the last RDD in the stage. On the other hand: Spark maps the number tasks on a particular Executor to the number of cores allocated to it. So, the first resource says that if I have 1000 partitions then I will have 1000 tasks no matter what my machine is. In the second case, If I have 4 core machine and 1000 partitions then what? I will have 4 tasks? Then how the data is processed?

Another confusion: each worker can process one task at a time and Executors can run multiple tasks over its lifetime, both in parallel and sequentially. So are tasks sequential or parallel?

1

1 Answers

2
votes
  • The number of tasks is given by the number of partitions of an RDD/DataFrame
  • The number of tasks which an executor can process in parallel is given by its number of cores, unless spark.task.cpus is configured to something else than 1 (which is the default value)

So think of tasks as some (independent) chunk of work which has to be processed. They can surely run in parallel

So if you have 1000 partitions and 5 executors which 4 cores each, 20 tasks will generally run in parallel