1
votes

Currently I am using Spark 2.0.0 in cluster mode (Standalone cluster) with the following cluster config:

Workers: 4 Cores in use: 32 Total, 32 Used Memory in use: 54.7 GB Total, 42.0 GB Used

I have 4 slaves (workers), and 1 master machine. There are 3 main parts to a Spark cluster - Master, Driver, Workers (ref)

Now my problem is that driver is starting up in one of the worker nodes, which is blocking me in using worker nodes in their full capacity (RAM wise). For example, if I run my spark job with 2g memory for driver, then I am left with only ~13gb memory in each machine for executor memory (assuming total RAM in each machine is 15gb). Now I think there can be 2 ways to fix this:

1) Run driver on master machine, this way I can specify full 15gb RAM as executor memory

2) Specify driver machine explicitly (one of the worker nodes), and assign memory to both driver and executor for this machine accordingly. For rest of the worker nodes I can specify max executor memory.

How do I achieve point 1 or 2? Or it is even possible?
Any pointers to it are appreciated.

1
What is your question?Yuval Itzchakov
@YuvalItzchakov edited question a bit. Basically the question is about how to specify driver host explicitly, so that worker nodes' RAM can be utilized to its max.Arry

1 Answers

3
votes

To run the driver on the master, run spark-submit from the master and specify --deploy-mode client. Launching applications with spark-submit.

It is not possible to specify which worker the driver will run on when using --deploy-mode cluster. However you can run the driver on a worker and achieve maximum cluster utilisation if you use a cluster manager such as yarn or mesos.