2
votes

I want to build a 'fat' jar of my code. I understand how to do this mostly but all the examples I have use the idea that the jar is not local and I am not sure how to include into my assembled jar another JAR that I built that the scala code uses. Like what folder does this JAR I have to include reside in?

Normally when I run my current code as a test using spark-shell it looks like this:

spark-shell --jars magellan_2.11-1.0.6-SNAPSHOT.jar -i st_magellan_abby2.scala 

(the jar file is right in the same path as the .scala file)

So now I want to build a build.sbt file that does the same and includes that SNAPSHOT.jar file?

name := "PSGApp"
version := "1.0"
scalaVersion := "2.11.8"

resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"

//provided means don't included it is there.  already on cluster?

libraryDependencies ++= Seq(
    "org.apache.spark" %% "spark-core" % "2.2.0" % "provided",
    "org.apache.spark" %% "spark-sql" % "2.2.0" % "provided",
    "org.apache.spark" %% "spark-streaming" % "2.2.0" % "provided",
    //add magellan here somehow?

)

So where would I put the jar in the SBT project folder structure so it gets picked up when I run sbt assembly? Is that in the main/resources folder? Which the reference manual says is where 'files to include in the main jar' go?

What would I put in the libraryDependencies here so it knows to add that specific jar and not go out into the internet to get it?

One last thing, I was also doing some imports in my test code that doesn't seem to fly now that I put this code in an object with a def main attached to it.

I had things like:

import sqlContext.implicits._ which was right in the code above where it was about to be used like so:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import sqlContext.implicits._
import org.apache.spark.sql.functions.udf

val distance =udf {(a: Point, b: Point) => 
   a.withinCircle(b, .001f);  //current radius set to .0001
}

I am not sure can I just keep these imports inside the def main? or do I have to move them elsewhere somehow? (Still learning scala and wrangling the scoping I guess).

1

1 Answers

3
votes

One way is to build your fat jar using the assembly plugin (https://github.com/sbt/sbt-assembly) locally and publishLocal to store the resulting jar into your local ivy2 cache

This will make it available for inclusion in your other project based on build.sbt settings in this project, eg:

name := "My Project"
organization := "org.me"
version := "0.1-SNAPSHOT"

Will be locally available as "org.me" %% "my-project" % "0.1-SNAPSHOT" SBT will search local cache before trying to download from external repo.

However, this is considered bad practise, because only final project should ever be a fat-jar. You should never include one as dependency (many headaches).

There is no reason to make project magellan a fat-jar if library is included in PGapp. Just publishLocal without assembly

Another way is to make projects dependant on each other as code, not library.

lazy val projMagellan = RootProject("../magellan")
lazy val projPSGApp = project.in(file(".")).dependsOn(projMagellan)

This makes compilation in projPSGApp tigger compilation in projMagellan.

It depends on your use case though.

Just don't get in a situation where you have to manage your .jar manually

The other question:

import sqlContext.implicits._ should always be included in the scope where dataframe actions are required, so you shouldn't put that import near the other ones in the header


Update

Based on discussion in comments, my advise would be:

  • Get the magellan repo

git clone [email protected]:harsha2010/magellan.git

  • Create a branch to work on, eg.

git checkout -b new-stuff

  • Change the code you want
  • Then update the versioning number, eg.

version := "1.0.7-SNAPSHOT"

  • Publish locally

sbt publishLocal

You'll see something like (after a while):

[info] published ivy to /Users/tomlous/.ivy2/local/harsha2010/magellan_2.11/1.0.7-SNAPSHOT/ivys/ivy.xml

  • Go to your other project
  • Change build.sbt to include

"harsha2010" %% "magellan" % "1.0.7-SNAPSHOT" in your libraryDependencies

Now you have a good (temp) reference to your library.

Your PSGApp should be build as an fat jar assembly to pass to Spark

sbt clean assembly

This will pull in the custom build jar

If the change in the magellan project is usefull for the rest of the world, you should push your changes and create a pull request, so that in the future you can just include the latest build of this library