Spring for Hadoop: issues with batch-spark sample on CDH 5.8

Question

I'm trying to run

https://github.com/trisberg/springone-2015/tree/master/batch-spark

on Cloudera Hadoop 5.8 ( quickstart ). I followed this guide in order to try to setup everything:

http://docs.spring.io/spring-hadoop/docs/current/reference/html/springandhadoop-spark.html

I fixed all the versions related to:

spark assembly to upload to HDFS is spark-assembly_2.10-1.6.0-cdh5.8.0.jar;
moved the property spring-data-hadoop.version in pom.xml to version 2.4.0.RELEASE-cdh5;
moved the property spark.version in pom.xml to 1.6.

I was able to build the project, upload the built artifact on CDH 5.8 quickstart's VM but, when trying to RUN, the batch fails.

When checking the log on Cloudera Manager I see the following error:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2570) at java.lang.Class.getMethod0(Class.java:2813) at java.lang.Class.getMethod(Class.java:1663) at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 6 more

I tried to submit the Spark Job through the following command

sudo -u hdfs spark-submit --class Hashtags --master yarn --deploy-mode cluster app/spark-hashtags_2.10-0.1.0.jar hdfs://quickstart.cloudera:8020/demo/hashtags/input/tweets.dat hdfs://quickstart.cloudera:8020/demo/hashtags/output*

( simulating the hdfs scripts to prepare input and output folders by hand )

and everything worked perfectly.

I was able to check the Resource Manager's logs in order to find any difference between the launching command produced by the Spring Batch's tasklet and the spark-submit command and I found that:

spark-submit puts the following:

org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command to launch container container_1486926591393_0015_02_000001 : LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:$LD_LIBRARY_PATH",{{JAVA_HOME}}/bin/java,-server,-Xmx1024m,-Djava.io.tmpdir={{PWD}}/tmp,-Dspark.yarn.app.container.log.dir=,-XX:MaxPermSize=256m,org.apache.spark.deploy.yarn.ApplicationMaster,--class,'Hashtags',--jar,file:/home/cloudera/spring-batch-spark/app/spark-hashtags_2.10-0.1.0.jar,--arg,'/tmp/hashtags/input/tweets.dat',--arg,'/tmp/hashtags/output',--executor-memory,1024m,--executor-cores,1,--properties-file,{{PWD}}/spark_conf/spark_conf.properties,1>,/stdout,2>,/stderr

Spring Batch's tasklet generates the following:

org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command to launch container container_1486833100526_0006_01_000001 : {{JAVA_HOME}}/bin/java,-server,-Xmx1024m,-Djava.io.tmpdir={{PWD}}/tmp,-Dspark.yarn.app.container.log.dir=,-XX:MaxPermSize=256m,org.apache.spark.deploy.yarn.ApplicationMaster,--class,'Hashtags',--jar,file:/home/cloudera/spring-batch-spark/app/spark-hashtags_2.10-0.1.0.jar,--arg,'hdfs://quickstart.cloudera:8020/demo/hashtags/input/tweets.dat',--arg,'hdfs://quickstart.cloudera:8020/demo/hashtags/output',--executor-memory,1024m,--executor-cores,1,--properties-file,{{PWD}}/spark_conf/spark_conf.properties,1>,/stdout,2>,/stderr

As you can see spark-submit adds LD_LIBRARY_PATH while Spring Batch's tasklet doesn't and since it seems it is the only different thing I'm thinking that the problem is there.

Due to my poor knowledge of the topic I'm not able to understand what is going on under the hoods. Did any of you run into this issue?

Thanks to everybody. Guido

Thomas Risberg Thomas Risberg · Accepted Answer · 2017-02-13T14:52:16

Thanks for the detailed comparison. I don't think the LD_LIBRARY_PATH would cause this particular error, I wonder if the difference in the --arg values have any impact. For the spark-submit example you used /tmp vs hdfs://quickstart.cloudera:8020/demo/ for the spring-hadoop one. Could you try the spark-submit with hdfs://quickstart.cloudera:8020/demo/ prefix?

UPDATE: It looks like the Cloudera provided assembly jar 'spark-assembly-1.6.0-cdh5.8.0-hadoop2.6.0-cdh5.8.0.jar' is missing the Hadoop configuration classes and can not be used with the "spring-data-hadoop-spark" features. You have to use the complete assembly jar provided by the Spark project in their downloads. I tested with 'spark-assembly-1.6.2-hadoop2.6.0.jar' and it worked fine on the Cloudera QuickStart VM 5.8.

Spring for Hadoop: issues with batch-spark sample on CDH 5.8

1 Answers