18
votes

On 3 node Spark/Hadoop cluster which scheduler(Manager) will work efficiently? Currently I am using Standalone Manager, but for each spark job I have to explicitly specify all resource parameters(e.g: cores,memory etc),which I want to avoid. I have tried Yarn as well, but it's running 10X slower than standalone manager.

Can Mesos will be helpful?

Cluster Details: Spark 1.2.1 and Hadoop 2.7.1

2
[Disclaimer: Not a Yarn expert] I think it strongly depends on what future workload you plan to add to your cluster. Mesos is a generic scheduler, while Yarn is more tailored for Hadoop workloads.rukletsov
Have a look at related SE question: stackoverflow.com/questions/28664834/…Ravindra babu

2 Answers

30
votes

Apache Spark runs in the following cluster modes

  • Local
  • Standalone
  • YARN
  • Mesos
  • Kubernetes
  • Nomad

Local mode is used to run Spark applications on Operating system. This mode is useful for Spark application development and testing.

Modes like standalone, Yarn, Mesos and Kubernetes modes are distributed environment. In distributed environment, resource management is very important to manage the computing resources. So to manage computing resources in efficient way, we need good resource management system or Resource Schedular.

Standalone is good for small spark clusters, but it is not good for bigger clusters (There is an overhead of running spark daemons(master + slave) in cluster nodes). These daemons require dedicated resources. So standalone is not recommended for bigger production clusters. Standalone supports only Spark applications and it is not general purpose cluster manager. In Enterprise context where we have variety of work loads to run, spark standalone cluster manager is not a good a choice.

In case of YARN and Mesos mode, Spark runs as an application and there are no daemons overhead. So we can use either YARN or Mesos for better performance and scalability. Both YARN and Mesos are general purpose distributed resource management and they support a variety of work loads like MapReduce, Spark, Flink, Storm etc... with container orchestration. They are good for running large scale Enterprise production clusters.

In between YARN and Mesos, YARN is specially designed for Hadoop work loads whereas Mesos is designed for all kinds of work loads. YARN is application level scheduler and Mesos is OS level scheduler. it is better to use YARN if you have already running Hadoop cluster (Apache/CDH/HDP). In case of a brand new project, better to use Mesos(Apache, Mesosphere). There is also a provision to use both of them in colocated manner using Project called Apache Myriad.

Kubernetes - Open source system for automating deployment, scaling, and management of containerized applications. So it used for running Spark applications in containerized fashion. Most of the cloud vendors like Google, Microsoft, Amazon offering Kubernetes platform as service in Cloud. We can also have on-prim K8S cluster to run Spark applications in containerized fashion. Here containers are Docker or CGroups/Linux Container.

Nomad - It is another open source system for running Spark applications. This cluster manager is not officially supported by the Spark project as a cluster manager.

Out of all above modes, Apache Mesos has better resource management capabilities.

Please see this link, it contains a detailed explanation from expertise about Yarn vs Mesos. http://www.quora.com/How-does-YARN-compare-to-Mesos

8
votes

On a 3 node cluster I'd just go with the standalone manager the overhead of the additional processes would not pay off