6
votes

I am new to Spark and I understand that Spark divides the executor memory into the following fractions:

RDD Storage: Which Spark uses to store persisted RDDs using .persist() or .cache() and can be defined by setting spark.storage.memoryFraction (default 0.6)

Shuffle and aggregation buffers: Which Spark uses to store shuffle outputs. It can defined using spark.shuffle.memoryFraction. If shuffle output exceeds this fraction, then Spark will spill data to disk (default 0.2)

User code: Spark uses this fraction to execute arbitrary user code (default 0.2)

I am not mentioning the storage and shuffle safety fractions for simplicity.

My question is, which memory fraction is Spark using to compute and transform RDDs that are not going to be persisted? For example:

lines = sc.textFile("i am a big file.txt")
count = lines.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add)

Here Spark will not load the whole file at once and will partition the input file and do all these transformations per partition in a single stage. However, which memory fraction Spark will use to load the partitioned lines, compute flatMap() and map()?

Thanks

Update:
The code shown above is only a subset of the actual application as count is saved using saveAsTextFile which will trigger the RDD calculation. Moreover, my question is generic to Spark's behavior and not specific to the posted example

2
Good question. Can you run the experiment where you count the rows after the map with 0.0 storage memory? (I guess it will complete, but I don't know) - Alister Lee
@AlisterLee I actually ran an experiment with 0 spark.storage.memoryFraction and it worked just fine, which is the expected behavior as I am not caching anything here. Moreover, when I ran the application with 0.6 storage memory fraction it shows that I am using zero bytes from the storage memory fraction - Walid Baruni
So if it's not using storage or shuffle, then have you answered your own question? You could prove it by experiments which progressively starve user memory. - Alister Lee
@AlisterLee I am actually not sure if it is not using shuffle fraction. Though I can say that it is not using storage fraction. - Walid Baruni
If your stage doesn't require a shuffle, then it isn't using shuffle memory - reduceByKey will shuffle though and you'll need shuffle for that. - Alister Lee

2 Answers

0
votes

This is the answer I got from Andrew Or in Spark's mailing list:

It would be whatever's left in the JVM. This is not explicitly controlled by a fraction like storage or shuffle. However, the computation usually doesn't need to use that much space. In my experience it's almost always the caching or the aggregation during shuffles that's the most memory intensive.

0
votes

From spark official guide : http://spark.apache.org/docs/latest/tuning.html#memory-management-overview

from the link above :

- spark.memory.fraction expresses the size of M as a fraction of the (JVM heap space - 300MB) (default 0.75). The rest of the space (25%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually large records.
- spark.memory.storageFraction expresses the size of R as a fraction of M (default 0.5). R is the storage space within M where cached blocks immune to being evicted by execution.