0
votes

Lets say we have a spark job running in Cluster Mode where the cluster manager is YARN.

In cluster mode

  1. a user submits a pre-compiled JAR, Python script to the cluster manager. The cluster manager than tells a specific Node Manager to launch the Application Master.
  2. The Spark Driver then runs on the Application Master. The driver converts the users code containing transformations and actions into a logical plan called the DAG. The DAG is then converted into a physical execution plan
  3. The application master then communicates with the cluster manager and negotiates resources. Resources such as preferred executor locations, and number of containers are requested.

At this point does the cluster manager allocate YARN containers or does the application master allocate the YARN containers? Does the cluster manager create the Spark Executors as well or does the application master do this?

1

1 Answers

1
votes
  1. A Spark application submitted to YARN translates into a YARN
    application.
  2. Hence when a Client submits a new application/job request to Resource Manager(Yarn component).
  3. Resource Manager (RM) accepts the job request and allocates a
    container to start Application Master for the given application/job.
    Application Master can be thought of as a specialized container which
    will manage/monitor application/job tasks.
  4. Application Master sends request to Resource Manager , asking for
    resources required to run the application/job.
  5. Resource Manager responds back with a list of containers along with
    list of salve nodes that they can be spawned on.
  6. Application Master starts the containers(in case of spark called as
    executors) on each of specified salve nodes.

The first fact to understand is: each Spark executor runs as a YARN container. This and the fact that Spark executors for an application are fixed, and so are the resources allotted to each executor, a Spark application takes up resources for its entire duration. This is in contrast with a MapReduce application which constantly returns resources at the end of each task, and is again allotted at the start of the next task.