1
votes

I have the following code:

val testRDD: RDD[(String, Vector)] = sc.parallelize(testArray)

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

val df = testRDD.toDF()

df.write.parquet(path)

with the following build.sbt:

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.1"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.6.1"

// META-INF discarding
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
  case "reference.conf" => MergeStrategy.concat
  case PathList("META-INF", xs @ _*) => MergeStrategy.discard
  case x => MergeStrategy.first
}
}

When I build this with sbt-assembly (I have addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.12.0")), and then I run it, I get an error:

Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: parquet. Please find packages at http://spark-packages.org
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:219)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
    at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:334)
    at InductionService.Application$.main(ParquetTest.scala:65)

However, if I build this using IntelliJ Idea (normal build, not fat JAR kind of thing as with sbt assembly), and debug it within that IDE, it actually works. So clearly there is something wrong with the way I build this using sbt assembly, but I don't know how to fix it. Any ideas?

I have suspicions that the merge-inf discard code in build.sbt might be the cause, but I need that code otherwise I can't build with sbt-assembly. (it complains about duplicates...)

2
I am having the same problem with maven-assembly-pluginomrsin

2 Answers

5
votes

I had sae problem. services folder in META-INF had some merge problems. I could fix this by adding a rule on the MergeStrategy:

case n if n.contains("services") => MergeStrategy.concat

This is what I have, and now it works:

val meta = """META.INF(.)*""".r
assemblyMergeStrategy in assembly := {
  case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first
  case PathList(ps @ _*) if ps.last endsWith ".html" => MergeStrategy.first
  case n if n.contains("services") => MergeStrategy.concat
  case n if n.startsWith("reference.conf") => MergeStrategy.concat
  case n if n.endsWith(".conf") => MergeStrategy.concat
  case meta(_) => MergeStrategy.discard
  case x => MergeStrategy.first
}
0
votes

The above solution is not ideal, it is partially correct, you do need to merge the META-INF/service/.... band discard the others in META-INF but the above solution does it with case n if n.contains("service") which is too general a condition. If any duplicated file, not just within META-INF includes the word 'service', which is very common, that file would be concatenated, including class files. If your program or one of its dependencies contains a class like com/amazonaws/services/s3/model/AmazonS3Exception, like in mine, it would concatenate them, causing :

java.lang.ClassFormatError: Extra bytes at the end of class file com/amazonaws/services/s3/model/AmazonS3Exception

It is much better to restrict the concat clause as much as possible, giving a clause "concat META-INF/services, discard anything else that's META-INF", this is one way to do it:

assemblyMergeStrategy in assembly := {
  case PathList("META-INF", "services",  _*)  => MergeStrategy.concat
  case PathList("META-INF", _*) => MergeStrategy.discard
  case _ => MergeStrategy.first

}