6
votes

We are trying to make a fat jar file containing one small scala source file and a ton of dependencies (simple mapreduce example using spark and cassandra):

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import com.datastax.spark.connector._
import org.apache.spark.SparkConf

object VMProcessProject {

    def main(args: Array[String]) {
        val conf = new SparkConf()
            .set("spark.cassandra.connection.host", "127.0.0.1")
             .set("spark.executor.extraClassPath", "C:\\Users\\SNCUser\\dataquest\\ScalaProjects\\lib\\spark-cassandra-connector-assembly-1.3.0-M2-SNAPSHOT.jar")
        println("got config")
        val sc = new SparkContext("spark://US-L15-0027:7077", "test", conf)
        println("Got spark context")

        val rdd = sc.cassandraTable("test_ks", "test_col")

        println("Got RDDs")

        println(rdd.count())

        val newRDD = rdd.map(x => 1)
        val count1 = newRDD.reduce((x, y) => x + y)

    }
}

We do not have a build.sbt file, instead putting jars into a lib folder and source files in the src/main/scala directory and running with sbt run. Our assembly.sbt file looks as follows:

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")

When we run sbt assembly we get the following error message:

...
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: java heap space
    at java.util.concurrent...

We're not sure how to change the jvm settings to increase the memory since we are using sbt assembly to make the jar. Also, if there is something egregiously wrong with how we are writing the code or building our project that'd help us out a lot too; there's been so many headaches trying to set up a basic spark program!

4

4 Answers

10
votes

sbt is essentially a java process. You can try to tune your sbt runtime heap size for the OutOfMemory issues.

For 0.13.x, the default memory options sbt uses is

-Xms1024m -Xmx1024m -XX:ReservedCodeCacheSize=128m -XX:MaxPermSize=256m.

And you can enlarge the heap size by doing something like

sbt -J-Xms2048m -J-Xmx2048m assembly
7
votes

I was including spark as an unmanaged dependency (putting the jar file in the lib folder) which used a lot of memory because it is a huge jar.

Instead, I made a build.sbt file which included spark as a provided, unmanaged dependency.
Secondly, I created the environment variable JAVA_OPTS with the value -Xms256m -Xmx4g, which sets the minimum heap size to 256 megabytes, while allowing the heap to grow to a maximum size of 4 gigabytes. These two combined allowed me to create a jar file with sbt assembly

More info on provided dependencies:

https://github.com/sbt/sbt-assembly

3
votes

this works for me:

sbt -mem 2000 "set test in assembly := {}" assembly
2
votes

I met the issue before. For my env, set Java_ops doesn't work. I use below command and it works.

  1. set SBT_OPTS="-Xmx4G"
  2. sbt assembly

There is no issue of out of memeory.