I'm trying to run
https://github.com/trisberg/springone-2015/tree/master/batch-spark
on Cloudera Hadoop 5.8 ( quickstart ). I followed this guide in order to try to setup everything:
http://docs.spring.io/spring-hadoop/docs/current/reference/html/springandhadoop-spark.html
I fixed all the versions related to:
- spark assembly to upload to HDFS is spark-assembly_2.10-1.6.0-cdh5.8.0.jar;
- moved the property spring-data-hadoop.version in pom.xml to version 2.4.0.RELEASE-cdh5;
- moved the property spark.version in pom.xml to 1.6.
I was able to build the project, upload the built artifact on CDH 5.8 quickstart's VM but, when trying to RUN, the batch fails.
When checking the log on Cloudera Manager I see the following error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2570) at java.lang.Class.getMethod0(Class.java:2813) at java.lang.Class.getMethod(Class.java:1663) at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 6 more
I tried to submit the Spark Job through the following command
- sudo -u hdfs spark-submit --class Hashtags --master yarn --deploy-mode cluster app/spark-hashtags_2.10-0.1.0.jar hdfs://quickstart.cloudera:8020/demo/hashtags/input/tweets.dat hdfs://quickstart.cloudera:8020/demo/hashtags/output*
( simulating the hdfs scripts to prepare input and output folders by hand )
and everything worked perfectly.
I was able to check the Resource Manager's logs in order to find any difference between the launching command produced by the Spring Batch's tasklet and the spark-submit command and I found that:
- spark-submit puts the following:
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command to launch container container_1486926591393_0015_02_000001 : LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:$LD_LIBRARY_PATH",{{JAVA_HOME}}/bin/java,-server,-Xmx1024m,-Djava.io.tmpdir={{PWD}}/tmp,-Dspark.yarn.app.container.log.dir=,-XX:MaxPermSize=256m,org.apache.spark.deploy.yarn.ApplicationMaster,--class,'Hashtags',--jar,file:/home/cloudera/spring-batch-spark/app/spark-hashtags_2.10-0.1.0.jar,--arg,'/tmp/hashtags/input/tweets.dat',--arg,'/tmp/hashtags/output',--executor-memory,1024m,--executor-cores,1,--properties-file,{{PWD}}/spark_conf/spark_conf.properties,1>,/stdout,2>,/stderr
- Spring Batch's tasklet generates the following:
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command to launch container container_1486833100526_0006_01_000001 : {{JAVA_HOME}}/bin/java,-server,-Xmx1024m,-Djava.io.tmpdir={{PWD}}/tmp,-Dspark.yarn.app.container.log.dir=,-XX:MaxPermSize=256m,org.apache.spark.deploy.yarn.ApplicationMaster,--class,'Hashtags',--jar,file:/home/cloudera/spring-batch-spark/app/spark-hashtags_2.10-0.1.0.jar,--arg,'hdfs://quickstart.cloudera:8020/demo/hashtags/input/tweets.dat',--arg,'hdfs://quickstart.cloudera:8020/demo/hashtags/output',--executor-memory,1024m,--executor-cores,1,--properties-file,{{PWD}}/spark_conf/spark_conf.properties,1>,/stdout,2>,/stderr
As you can see spark-submit adds LD_LIBRARY_PATH while Spring Batch's tasklet doesn't and since it seems it is the only different thing I'm thinking that the problem is there.
Due to my poor knowledge of the topic I'm not able to understand what is going on under the hoods. Did any of you run into this issue?
Thanks to everybody. Guido