Scala 2.11 Spark 2.0 hortonworks-spark/shc sbt assemby

Question

I try to assembly a Scala 2.11 Spark 2.0 application using hortonworks-spark/shc to access hbase.

The dependencies set looks simple:

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "2.0.2" % "provided",
  "com.hortonworks" % "shc-core" % "1.0.1-2.0-s_2.11"
)

The problem comes when I try to assembly the application in a Fat jar, because there are a lot of transience dependencies with different version, then the assembly plugin throw duplicate errors. One example:

deduplicate: different file contents found in the following:
[error] /home/search/.ivy2/cache/org.mortbay.jetty/jsp-2.1/jars/jsp-2.1-6.1.14.jar:org/apache/jasper/xmlparser/XMLString.class
[error] /home/search/.ivy2/cache/tomcat/jasper-compiler/jars/jasper-compiler-5.5.23.jar:org/apache/jasper/xmlparser/XMLString.class

Also, I don't know if it is right include in the jar dependencies like org.apache.hbase:hbase-server:1.1.2

So, basically, the question is: Anyone know the right way to assembly an Scala Spark application using this library and sbt and can provide an example? (And maybe add it in the documentation of hortonworks-spark/shc

Note: hortonworks-spark/shc is not include in spark-packages so I can not use the --packages option if it is not with a local copy of the jars. I am using EMR, so I don't have a preconfigured cluster where copy the jar without add more complexity to the deployment.

Could you describe in question which problem come when you assemble fat jar? — Nikita

Nikita Nikita · Accepted Answer · 2016-12-07T14:32:33

You should specify mergeStrategy mentioned in Readme. And you can provide common libraries on spark nodes instead of including them in fat.jar each time. The rought way to do it is to upload them on each worker and add to classpath.

Scala 2.11 Spark 2.0 hortonworks-spark/shc sbt assemby

3 Answers