Build Spark Uber jar in Maven instead of multiple Uber jars (one per module)

Question

I've written a script in Spark/scala to process a large graph, and can compile/run it on Intellij 14 within the Spark source-code project (downloaded version 1.2.1). What I'm trying to do now is build the Uber jar to create a single executable file I can upload to EC2 and run. I'm aware of the plugins that are supposed to create the fat jar for the project. However I can't figure out how to do this - both plugins just create 'uber' jars for each module rather than a main jar.

To be clear: I have tried both the Maven-Assembly and the Maven-Shade plugins, and each time it creates 10 main jars (called either 'jar with dependencies' or Uber' respectively) rather than one main jar. It is creating an Uber for core_2.10, another for streaming_2.10, another for graphx_2.10, and so on.

I have tried altering the settings and configurations of the Maven plugins. For example, I tried adding this to the Shade plugin:

<configuration>
  <shadedArtifactAttached>false</shadedArtifactAttached>
  <artifactSet>
    <includes>
      <include>org.spark-project.spark:unused</include>
    </includes>
  </artifactSet>
</configuration>
<executions>
  <execution>
    <phase>package</phase>
    <goals>
      <goal>shade</goal>
    </goals>
  </execution>
</executions>

I've also tried the alternative Maven-assembly plugin:

<configuration>
  <descriptorRefs>
    <descriptorRef>jar-with-dependencies</descriptorRef>
  </descriptorRefs>
  <archive>
    <manifest>
    <mainClass>org.apache.spark.examples.graphx.PageRankGraphX</mainClass>
    </manifest>
  </archive>

</configuration>
<executions>
  <execution>
  <id>make-assembly</id>
  <phase>package</phase> 
  <goals>
    <goal>single</goal>
  </goals>
  </execution>
</executions>

I would also point out that I've tried a number of variations on the plugin settings available online, but none has worked. It's fairly obvious that something is wrong with the project set-up. However, this isn't my project - it's a source-code installation of Apache Spark, so I have no idea why it would be so impossible to build.

I am creating the build with the command line

mvn package -DskipTests

I would appreciate help and suggestions.

Edit:

Further investigation shows that many of the Spark module dependencies in the final module are set as 'provided' in the pom (that would be org.spark.graphx, org.spark.streaming, org.spark.mlib, etc). However, running the jar for this 'final' module (the examples module) fails to find classes in those modules (ie. those dependencies). Perhaps someone with more experience knows what this means.

See if this answer can help you: stackoverflow.com/questions/29394920/#29421067 — Bruno César
Thanks Bruno. I've tried both of these plugins and unfortunately it hasn't worked - they create 'uber' or 'jar-with-dependencies' jars for each module - they fail to create a single jar for the whole Spark project. — user3297367

Sean Owen Sean Owen · Accepted Answer · 2015-04-14T07:23:52

You are looking for the product of mvn package in the assembly module. You do not need to add to or modify the build.

However bundling an uber jar may not be the right way to set up and run a cluster on EC2. There is a script in ec2 for turning up a cluster. And then you generally spark-submit your app (which includes no Spark/Hadoop classes) in the cluster.

Build Spark Uber jar in Maven instead of multiple Uber jars (one per module)

1 Answers