1
votes

I have been struggling for days with the installation of Spark on a cluster.

Because the cluster uses Hadoop 2.2 and because I want to use PySpark on YARN. I had to build Spark using MAVEN. The output of this process is a .jar file: spark-assembly-1.2.0-hadoop2.2.0.jar (I am not familiar with Java). This .jar file will not run if I try to execute it on any of my nodes using Java ("could not find or load main class").

The installation instructions I find involve running a .sh file, which was not the output of my MAVEN build.

What am I missing here? I cannot find answers in the documentation.

1

1 Answers

0
votes

You do not need to build Spark using Maven in order to use PyShark. You use the submission scripts in the pre-built Spark package.

edit:

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" export JAVA_HOME=your_java_home

./make-distribution.sh -Pyarn -Phadoop-2.2

The resulting distribution will be in the dist directory.