I find official documentation quite thorough and covering all your questions. However, one might find it hard to digest from the first time.
Let us put some definitions and rough analogues before we delve into details. application
is what creates SparkContext sc
and may be referred to as something you deploy with spark-submit. job
is an action in spark definition of transformation and action meaning anything like count, collect etc.
There are two main and in some sense separate topics: Scheduling Across application
s and Scheduling Within application
. The former relates more to Resource Managers including Spark Standalone FIFO only mode and also concept of static and dynamic allocation.
The later, Scheduling Within Spark application
is the matter of your question, as I understood from your comment. Let me try to describe what happens there at some level of abstraction.
Suppose, you submitted your application
and you have two job
s
sc.textFile("..").count() //job1
sc.textFile("..").collect() //job2
If this code happens to be executed in the same thread there is no much interesting happening here, job2 and all its tasks get resources only after job1 is done.
Now say you have the following
thread1 { job1 }
thread2 { job2 }
This is getting interesting. By default, within your application
scheduler will use FIFO to allocate resources to all the tasks of whichever job
happens to appear to scheduler as first. Tasks for the other job
will get resources only when there are spare cores and no more pending tasks from more "prioritized" first job
.
Now suppose you set spark.scheduler.mode=FAIR
for your application
. From now on each job
has a notion of pool
it belongs to. If you do nothing then for every job pool
label is "default". To set the label for your job
you can do the following
sc.setLocalProperty("spark.scheduler.pool", "pool1").textFile("").count() // job1
sc.setLocalProperty("spark.scheduler.pool", "pool2").textFile("").collect() // job2
One important note here is that setLocalProperty is effective per thread and also all spawned threads. What it means for us? Well if you are within the same thread it means nothing as job
s are executed one after another.
However, once you have the following
thread1 { job1 } // pool1
thread2 { job2 } // pool2
job1 and job2 become unrelated in the sense of resource allocation. In general, properly configuring each pool in fairscheduler file with minShare > 0 you can be sure that job
s from different pools will have resources to proceed.
However, you can go even further. By default, within each pool
job
s are queued up in a FIFO manner and this situation is basically the same as in the scenario when we have had FIFO mode and job
s from different threads. To change that you you need to change the pool
in the xml file to have <schedulingMode>FAIR</schedulingMode>
.
Given all that, if you just set spark.scheduler.mode=FAIR
and let all the job
s fall into the same "default" pool, this is roughly the same as if you would use default spark.scheduler.mode=FIFO
and have your job
s be launched in different threads. If you still just want single "default" fair pool just change config for "default" pool in xml file to reflect that.
To leverage the mechanism of pool
s you need to define the concept of user
which is the same as setting "spark.scheduler.pool" from a proper thread to a proper value. For example, if your application
listens to JMS, then a message processor may set the pool label for each message processing job
depending on its content.
Eventually, not sure if the number of words is less than in the official doc, but hopefully it helps is some way :)