How does Spark-submit in cluster deploy mode manage the application Jars

Question

In the Book Spark in Action, i am reading this:

“If you’re submitting your application in cluster-deploy mode using the spark-submit script, the JAR file you specify needs to be available on the worker (at the location you specified) that will be executing the application. Because there’s no way to say in advance which worker will execute your driver, you should put your application’s JAR file on all the workers if you intend to use cluster-deploy mode, or you can put your application’s JAR file on HDFS and use the HDFS URL as the JAR filename.”

But in the official documentation I see this:

1 - If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To do this, create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime. Once you have an assembled jar you can call the bin/spark-submit script as shown here while passing your jar.

2-If your application is launched through Spark submit, then the application jar is automatically distributed to all worker nodes. For any additional jars that your application depends on, you should specify them through the --jars flag using comma as a delimiter (e.g. --jars jar1,jar2). To control the application’s configuration or execution environment, see Spark Configuration.

What am I missing here ? How does it work ? Do i need to deploy my assembly jar all over the cluster (expect for the master node) ?

maasg maasg · Accepted Answer · 2017-08-10T12:40:20

The official documentation is correct (as we would expect).

TL;DR:
There is no need to copy application files or dependencies across the cluster to submit a Spark job with spark-submit.

spark-submit takes care of delivering the application jar to the executors. Even more, the jar files specified using the --jars option are also served by the file server on the driver program to all executors, so we don't need to copy any dependencies to the executors, either. Spark takes care of that for you.

Further details are available on the Advanced Dependency Management page

How does Spark-submit in cluster deploy mode manage the application Jars

4 Answers