In spark yarn cluster, How to work the container depends on the number of RDD partitions?

Question

i have a one problem about Apache Spark(yarn cluster)

In this code, although, create 10 partition but In yarn cluster, just work 3 of contatiner

val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))

    val sparktest = sc.textFile("/spark_test/58GB.dat",10)
    val test = sparktest.flatMap(line=> line.split(" ")).map(word=>(word, 1))

In spark yarn cluster, How to work the container depends on the number of RDD partitions?

*Because i have a just little bit english skill, i hope to your understanding about my awkward english

0x0FFF 0x0FFF · Accepted Answer · 2015-03-13T10:11:28

Spark executor running in YARN is simply a JVM process, this process is sometimes referenced as YARN Container. If you say you use 3 containers, it means that you have 3 JVMs running on the YARN cluster nodes, basically the nodes running YARN NodeManager.

When you start Spark on YARN cluster, you can specify number of executors you want to have with --num-executors and amount of memory dedicated to each of them with --executor-memory

When you read a file to RDD and specify that it should have 10 partitions, it means that during the execution of your code the source file would be read into 10 partitions. Each partition is a chunk of data stored in memory of a single JVM, and the node to store them is chosen based on locality of the source data.

In your specific case with textFile and setting up number of partitions, this number would be pushed down to the Hadoop TextInputFormat class that would implement reading the file in 10 splits based on the file size (each split would be approximately 5.8GB)

So in fact, after reading the source file into 10 partition (I assume that you would execute cache() and the action like count() on top of it), you would have 10 chunks of the data, each one is ~5.8GB, stored in the heap of 3 JVM processes running as YARN containers on your cluster. If you don't have enough RAM, only part of them would be cached. If you don't have enough RAM to handle a single partition of 5.8GB, you would get out of memory error

In spark yarn cluster, How to work the container depends on the number of RDD partitions?

In spark yarn cluster, How to work the container depends on the number of RDD partitions?

1 Answers