As you stated on the second way of running a spark job, packaging a java file with Spark classes and/or syntax is essentially wrapping your Spark job within a Hadoop job. This can have its disadvantages (mainly that your job gets directly dependent on the java and scala version you have on your system/cluster, but also some growing pains about the support between the different frameworks' versions). So in that case, the developer must be careful about the setup that the job will be run on on two different platforms, even if it seems a bit simpler for users of Hadoop which have a better grasp with Java and the Map/Reduce/Driver layout instead of the more already-tweaked nature of Spark and the sort-of-steep-learning-curve convenience of Scala.
The first way of submitting a job is the most "standard" (as far as the majority of usage it can be seen online, so take this with a grain of salt), operating the execution of the job almost entirely within Spark (except if you store the output of your job or take its input from the HDFS, of course). By using this way, you are only somewhat dependent to Spark, keeping the strange ways of Hadoop (aka its YARN resource management) away from your job. And it can be significantly faster in execution time, since it's the most direct approach.
hadoop jar
will only add hadoop related JARs to classpath while executing the JAR whilstspark-submit
will add spark core, sql as well as hadoop related JARs as well. – philantroverthadoop jar
is correct for Spark. There's no way to pass executor parameters, for example, also you shouldn'tsetMaster
manually in the code, so therefore it wouldn't know to run in YARN – OneCricketeerhadoop jar
isn't putting Spark libraries into the classpath. Nor should your uber jar contain spark-core – OneCricketeer