3
votes

I try to assembly a Scala 2.11 Spark 2.0 application using hortonworks-spark/shc to access hbase.

The dependencies set looks simple:

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "2.0.2" % "provided",
  "com.hortonworks" % "shc-core" % "1.0.1-2.0-s_2.11"
)

The problem comes when I try to assembly the application in a Fat jar, because there are a lot of transience dependencies with different version, then the assembly plugin throw duplicate errors. One example:

deduplicate: different file contents found in the following:
[error] /home/search/.ivy2/cache/org.mortbay.jetty/jsp-2.1/jars/jsp-2.1-6.1.14.jar:org/apache/jasper/xmlparser/XMLString.class
[error] /home/search/.ivy2/cache/tomcat/jasper-compiler/jars/jasper-compiler-5.5.23.jar:org/apache/jasper/xmlparser/XMLString.class

Also, I don't know if it is right include in the jar dependencies like org.apache.hbase:hbase-server:1.1.2

So, basically, the question is: Anyone know the right way to assembly an Scala Spark application using this library and sbt and can provide an example? (And maybe add it in the documentation of hortonworks-spark/shc

Note: hortonworks-spark/shc is not include in spark-packages so I can not use the --packages option if it is not with a local copy of the jars. I am using EMR, so I don't have a preconfigured cluster where copy the jar without add more complexity to the deployment.

3
Could you describe in question which problem come when you assemble fat jar?Nikita
@ipoteka i'm sorry. Added one error example.angelcervera

3 Answers

0
votes

You should specify mergeStrategy mentioned in Readme. And you can provide common libraries on spark nodes instead of including them in fat.jar each time. The rought way to do it is to upload them on each worker and add to classpath.

0
votes

As you can see jasper-xmlparser is being used by two different jars. So the duplicate. You can exclude the reference from one of them as follows

For example : libraryDependencies += "org.apache.hbase" % "hbase-server" % "0.98.12-hadoop2" excludeAll ExclusionRule(organization = "org.mortbay.jetty")

And regarding adding all dependencies into fat jar, atleast for spark applications, that is the suggested way of handling the jar assembly. The most common jars(like spark, hbase etc) can be made part of the class path on the edge node or which ever node you are running. Any other specific jars, can be part of uber jar. You can make the spark/hbase jars as provided.

Well at the core, the following should help you to do basic operations.

libraryDependencies += "org.apache.hbase" % "hbase-client" % "0.98.12-hadoop2" // % "provided"
libraryDependencies += "org.apache.hbase" % "hbase-common" % "0.98.12-hadoop2"  //% "provided"
libraryDependencies += "org.apache.hbase" % "hbase-server" % "0.98.12-hadoop2" excludeAll ExclusionRule(organization = "org.mortbay.jetty")
0
votes

You have to provide mergeStrategy in build.sbt,

It looks something like this,

assemblyMergeStrategy in assembly := {
    case PathList("META-INF", xs @ _*) => MergeStrategy.discard
    case "reference.conf" => MergeStrategy.concat
    case x => MergeStrategy.first
}

In this example I have specified MergeStrategy.first . There are few other options like MergeStrategy.last and MergeStrategy.concat

MergeStrategy.first means it will select the first jar it sees for a given dependency to create an uber jar.

For some of the cases it may not work, if that is the case please try with MergeStrategy.last also.