Apache Spark 1.2.1 standalone cluster giving java heap space error

Question

I need information about, how to figure out how much heap space(memory) would be needed to operate on x mb(suppose x means 600 mb) in spark standalone cluster.

Scenario:

I have standalone cluster with 14gb memory and 8 cores. I want to operate(Reading data from files and writing it to Cassandra) on 600 MB of data.

For this task I have SparkConfig as:

.set("spark.cassandra.output.throughput_mb_per_sec","800")

.set("spark.storage.memoryFraction", "0.3")

And --executor-memory=5g --total-executor-cores 6 --driver-memory 6g at the time of submitting task.

In spite of above configuration,I getting java heap space error while writing data to Cassandra.

Below is the java code:

    public static void main(String[] args) throws Exception {
    String fileName = args[0];

    Long now = new Date().getTime();

    SparkConf conf = new SparkConf(true)
            .setAppName("JavaSparkSQL_" +now)
            .set("spark.cassandra.connection.host", "192.168.1.65")
            .set("spark.cassandra.connection.native.port", "9042")
            .set("spark.cassandra.connection.rpc.port", "9160")
            .set("spark.cassandra.output.throughput_mb_per_sec","800")
            .set("spark.storage.memoryFraction", "0.3");

    JavaSparkContext ctx = new JavaSparkContext(conf);


    JavaRDD<String> input =ctx.textFile    
("hdfs://abc.xyz.net:9000/figmd/resources/" + fileName, 12);
    JavaRDD<PlanOfCare> result = input.mapPartitions(new 
ParseJson()).filter(new PickInputData());

    System.out.print("Count --> "+result.count());
    System.out.println(StringUtils.join(result.collect(), ","));


 javaFunctions(result).writerBuilder("ks","pt_planofcarelarge",
 mapToRow(PlanOfCare.class)).saveToCassandra();

}

What configuration I am suppose to do?Am I missing anything? Thanks in advance.

Thanks for responding. Heap space error come while writing data to the Cassandra. — Abhinandan Satpute
you need to share some code. maybe you are doing a collect over your rdd that results in a heap space error or maybe 1 of many others things you might be doing to blow up your heap! Your question is not solvable like this. — eliasah
Right. I am doing collect as well.I will share code right now. — Abhinandan Satpute
In order of magnitude, what is the size and count of your JavaRDD result? — eliasah

eliasah eliasah · Accepted Answer · 2015-04-29T11:50:36

JavaRDD collect method return an array that contains all of the elements in this RDD.

So in your case, it will creates an array with 340000 elements which will result in a Java Heap Error, you may want to take a small sample of your data and collect it or you may want to save it directly to your disk.

For more information about JavaRDD, you can always refer to the official documentation.

Apache Spark 1.2.1 standalone cluster giving java heap space error

1 Answers