2
votes

I've created a standalone spark (2.1.1) cluster on my local machines with 9 cores / 80G each machine (total of 27 cores / 240G Ram)

I've got a sample spark job that sum all the numbers from 1 to x this is the code :

package com.example

import org.apache.spark.sql.SparkSession

object ExampleMain {

    def main(args: Array[String]): Unit = {
      val spark = SparkSession.builder
          .master("spark://192.168.1.2:7077")
          .config("spark.driver.maxResultSize" ,"3g")
          .appName("ExampleApp")
          .getOrCreate()
      val sc = spark.SparkContext
      val rdd = sc.parallelize(Lisst.range(1, 1000))
      val sum = rdd.reduce((a,b) => a+b)
      println(sum)
      done
    }

    def done = {
      println("\n\n")
      println("-------- DONE --------")
    }
}

When running the above code I get results after a few seconds so I've crancked up the code to sum all the numbers from 1 to 1B (1,000,000,000) and than I get GC overhead limit reached

I read that spark should spill memory to the HDD if there isn't enough memory, I've tried to play with my cluster configuration but that didn't helped.

Driver memory = 6G
Number of workers = 24
Cores per worker = 1
Memory per worker = 10

I'm not a developer, and have no knowledge in Scala but would like to find a solution to run this code without GC issues.

Per @philantrovert request I'm adding my spark-submit command

/opt/spark-2.1.1/bin/spark-submit \
--class "com.example.ExampleMain" \
--master spark://192.168.1.2:6066 \
--deploy-mode cluster \
/mnt/spark-share/example_2.11-1.0.jar

In addition my spark/conf are as following:

  • slaves file contain the 3 IP addresses of my nodes (including the master)
  • spark-defaults contain:
    • spark.master spark://192.168.1.2:7077
    • spark.driver.memory 10g
  • spark-env.sh contain:
    • SPARK_LOCAL_DIRS= shared folder among all nodes
    • SPARK_EXECUTOR_MEMORY=10G
    • SPARK_DRIVER_MEMORY=10G
    • SPARK_WORKER_CORES=1
    • SPARK_WORKER_MEMORY=10G
    • SPARK_WORKER_INSTANCES=8
    • SPARK_WORKER_DIR= shared folder among all nodes
    • SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"

Thanks

1
Can you add your spark-submit command to the question ?philantrovert
@philantrovert added the spark-submit + my spark configurationY. Eliash
try to add --conf "spark.driver.maxResultSize=3G" to your spark-submit instead of your program. I haven't worked with Spark Standalone clusters but I think the driver would start before it can execute the conf.set(..) in your program. I might be wrong.philantrovert
have you tried to create your rdd using val rdd = spark.range(1000000000L).rdd? I think creating a scala list with 1 billion entries is the problem here...Raphael Roth
@philantrovert - the conf didn't helped and the job failed after 21 minutesY. Eliash

1 Answers

3
votes

I suppose the problem is that you create a List with 1 Billion entries on the driver, which is a huge datastructure (4GB). There is a more efficient way the programmatically create an Dataset/RDD:

val rdd = spark.range(1000000000L).rdd