3
votes

I am trying to submit a spark job to the CDH yarn cluster via the following commands

I have tried several combinations and it all does not work... I now have all the poi jars located in both my local /root, as well as HDFS /user/root/lib, hence I have tried the following

spark-submit --master yarn-cluster --class "ReadExcelSC" ./excel_sc.jar --jars /root/poi-3.12.jars, /root/poi-ooxml-3.12.jar, /root/poi-ooxml-schemas-3.12.jar

spark-submit --master yarn-cluster --class "ReadExcelSC" ./excel_sc.jar --jars file:/root/poi-3.12.jars, file:/root/poi-ooxml-3.12.jar, file:/root/poi-ooxml-schemas-3.12.jar

spark-submit --master yarn-cluster --class "ReadExcelSC" ./excel_sc.jar --jars hdfs://mynamenodeIP:8020/user/root/poi-3.12.jars,hdfs://mynamenodeIP:8020/user/root/poi-ooxml-3.12.jar,hdfs://mynamenodeIP:8020/user/root/poi-ooxml-schemas-3.12.jar

How do I propogate the jars to all cluster nodes? because none of the above is working, and the job still somehow does not get to reference the class, as I keep getting the same error:

java.lang.NoClassDefFoundError: org/apache/poi/ss/usermodel/WorkbookFactory

The same command works with "--master local", without specifying the --jars, as I have copied my jars to /opt/cloudera/parcels/CDH/lib/spark/lib.

However for yarn-cluster mode, I would need to distribute the external jars to all cluster, but the above code does not work.

Appreciate your help, thanks.

p.s. I am using CDH5.4.2 with spark 1.3.0

2

2 Answers

3
votes

According to help options from Spark Submit

  • --jars includes the local jars to include on the driver and executor classpaths. [it will just set the path]

  • ---files will copy the jars needed for you appication to run to all the working dir of executor nodes [it will transport your jar to
    working dir]

Note: This is similar to -file options in hadoop streaming , which transports the mapper/reducer scripts to slave nodes.

So try with --files options as well.

$ spark-submit --help
Options:
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor.

hope this helps

0
votes

Have you tried the solution posted in this thread: Spark on yarn jar upload problems

The problem was solved by copying spark-assembly.jar into a directory on the hdfs for each node and then passing it to spark-submit --conf spark.yarn.jar as a parameter. Commands are listed below:

hdfs dfs -copyFromLocal /var/tmp/spark/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar /user/spark/spark-assembly.jar 

/var/tmp/spark/spark-1.4.0-bin-hadoop2.4/bin/spark-submit --class MRContainer --master yarn-cluster  --conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar simplemr.jar