Both spark and yarn are distributed framework , but their roles are different:
Yarn is a resource management framework, for each application, it has following roles:
ApplicationMaster: resource management of a single application, including ask for/release resource from Yarn for the application and monitor.
Attempt: an attempt is just a normal process which does part of the whole job of the application. For example , a mapreduce job which consists of multiple mappers and reducers , each mapper and reducer is an Attempt.
A common process of summiting a application to yarn is:
The client submit the application request to yarn. In the
request, Yarn should know the ApplicationMaster class; For
SparkApplication, it is
org.apache.spark.deploy.yarn.ApplicationMaster
,for MapReduce job ,
it is org.apache.hadoop.mapreduce.v2.app.MRAppMaster
.
Yarn allocate some resource for the ApplicationMaster process and
start the ApplicationMaster process in one of the cluster nodes;
After ApplicationMaster starts, ApplicationMaster will request resource from Yarn for this Application and start up worker;
For Spark, the distributed computing framework, a computing job is divided into many small tasks and each Executor will be responsible for each task, the Driver will collect the result of all Executor tasks and get a global result. A spark application has only one driver with multiple executors.
So, then ,the problem comes when Spark is using Yarn as a resource management tool in a cluster:
In Yarn Cluster Mode, Spark client will submit spark application to
yarn, both Spark Driver and Spark Executor are under the supervision
of yarn. In yarn's perspective, Spark Driver and Spark Executor have
no difference, but normal java processes, namely an application
worker process. So, when the client process is gone , e.g. the client
process is terminated or killed, the Spark Application on yarn is
still running.
In yarn client mode, only the Spark Executor are under the
supervision of yarn. The Yarn ApplicationMaster will request resource
for just spark executor. The driver program is running in the client
process which have nothing to do with yarn, just a process submitting
application to yarn.So ,when the client leave, e.g. the client
process exits, the Driver is down and the computing terminated.