Run Scala Spark with SBT

Question

The code below causes Spark to become unresponsive:

System.setProperty("hadoop.home.dir", "H:\\winutils");

val sparkConf = new SparkConf().setAppName("GroupBy Test").setMaster("local[1]")
val sc = new SparkContext(sparkConf)

def main(args: Array[String]) {

    val text_file = sc.textFile("h:\\data\\details.txt")

    val counts = text_file
      .flatMap(line => line.split(" "))
      .map(word => (word, 1))
      .reduceByKey(_ + _)

    println(counts);

}

I'm setting hadoop.home.dir in order to avoid the error mentioned here: Failed to locate the winutils binary in the hadoop binary path

This is how my build.sbt file looks like:

lazy val root = (project in file(".")).
  settings(
    name := "hello",
    version := "1.0",
    scalaVersion := "2.11.0"
  )


libraryDependencies  ++= Seq(

            "org.apache.spark" % "spark-core_2.11" % "1.6.0"

)

Should Scala Spark be compilable/runnable using the sbt code in the file?

I think code is fine, it was taken verbatim from http://spark.apache.org/examples.html, but I am not sure if the Hadoop WinUtils path is required.

Update: "The solution was to use fork := true in the main build.sbt" Here is the reference: Spark: ClassNotFoundException when running hello world example in scala 2.11

This runs just fine as it is on my end (the only modifications I made are different paths for the textfile+winutils). It starts up, prints ShuffledRDD[4] at reduceByKey at Application.scala:18 and shuts down. The only thing I see immediately is that there is no action used at the end of the transformations, i.e. the data never gets computed and returned to the driver, the code only generates an RDD with a few transformations. But that shouldn't cause the application to hang, especially not in single thread local mode (streaming needs at least 2 threads though, but youre not using that). — alextsc

Alberto Bonsanto Alberto Bonsanto · Accepted Answer · 2016-03-15T20:28:55

This is the content of my build.sbt. Notice that if your internet connection is slow it might take some time.

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "1.6.1",
  "org.apache.spark" %% "spark-mllib" % "1.6.1",
  "org.apache.spark" %% "spark-sql" % "1.6.1",
  "org.slf4j" % "slf4j-api" % "1.7.12"
)


run in Compile <<= Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run))

In the main I added this, however it depends on where you placed the winutil folder.

System.setProperty("hadoop.home.dir", "c:\\winutil")

Run Scala Spark with SBT

1 Answers