I have the following code:
val testRDD: RDD[(String, Vector)] = sc.parallelize(testArray)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val df = testRDD.toDF()
df.write.parquet(path)
with the following build.sbt:
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.1"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.6.1"
// META-INF discarding
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case "reference.conf" => MergeStrategy.concat
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
}
When I build this with sbt-assembly (I have addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.12.0")), and then I run it, I get an error:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: parquet. Please find packages at http://spark-packages.org
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:219)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:334)
at InductionService.Application$.main(ParquetTest.scala:65)
However, if I build this using IntelliJ Idea (normal build, not fat JAR kind of thing as with sbt assembly), and debug it within that IDE, it actually works. So clearly there is something wrong with the way I build this using sbt assembly, but I don't know how to fix it. Any ideas?
I have suspicions that the merge-inf discard code in build.sbt might be the cause, but I need that code otherwise I can't build with sbt-assembly. (it complains about duplicates...)