What is the difference between submitting spark job to spark-submit and to hadoop directly?

Question

I have noticed that in my project there are 2 ways of running spark jobs.

First way is submitting a job to spark-submit file

./bin/spark-submit
--class org.apache.spark.examples.SparkPi
--master local[8]
/path/to/examples.jar
100
Second way is to package java file into jar and run it via hadoop, while having Spark code inside MainClassName:

hadoop jar JarFile.jar MainClassName

` What is the difference between these 2 ways? Which prerequisites I need to have in order to use either?

I believe it's the kind of JARs that will be added to classpath. hadoop jar will only add hadoop related JARs to classpath while executing the JAR whilst spark-submit will add spark core, sql as well as hadoop related JARs as well. — philantrovert
I doubt hadoop jar is correct for Spark. There's no way to pass executor parameters, for example, also you shouldn't setMaster manually in the code, so therefore it wouldn't know to run in YARN — OneCricketeer
philantrovert, looks like haddop jar command is executing jar on hadoop: stackoverflow.com/questions/13012511/…. The question is how this is parallelized if it is not a mapreduce jar. — MiamiBeach
Well, it shouldn't run at all because hadoop jar isn't putting Spark libraries into the classpath. Nor should your uber jar contain spark-core — OneCricketeer

Coursal Coursal · Accepted Answer · 2020-11-18T16:05:29

As you stated on the second way of running a spark job, packaging a java file with Spark classes and/or syntax is essentially wrapping your Spark job within a Hadoop job. This can have its disadvantages (mainly that your job gets directly dependent on the java and scala version you have on your system/cluster, but also some growing pains about the support between the different frameworks' versions). So in that case, the developer must be careful about the setup that the job will be run on on two different platforms, even if it seems a bit simpler for users of Hadoop which have a better grasp with Java and the Map/Reduce/Driver layout instead of the more already-tweaked nature of Spark and the sort-of-steep-learning-curve convenience of Scala.

The first way of submitting a job is the most "standard" (as far as the majority of usage it can be seen online, so take this with a grain of salt), operating the execution of the job almost entirely within Spark (except if you store the output of your job or take its input from the HDFS, of course). By using this way, you are only somewhat dependent to Spark, keeping the strange ways of Hadoop (aka its YARN resource management) away from your job. And it can be significantly faster in execution time, since it's the most direct approach.

What is the difference between submitting spark job to spark-submit and to hadoop directly?

1 Answers