I am new to Spark and I understand that Spark divides the executor memory into the following fractions:
RDD Storage: Which Spark uses to store persisted RDDs using .persist() or .cache() and can be defined by setting spark.storage.memoryFraction (default 0.6)
Shuffle and aggregation buffers: Which Spark uses to store shuffle outputs. It can defined using spark.shuffle.memoryFraction. If shuffle output exceeds this fraction, then Spark will spill data to disk (default 0.2)
User code: Spark uses this fraction to execute arbitrary user code (default 0.2)
I am not mentioning the storage and shuffle safety fractions for simplicity.
My question is, which memory fraction is Spark using to compute and transform RDDs that are not going to be persisted? For example:
lines = sc.textFile("i am a big file.txt")
count = lines.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add)
Here Spark will not load the whole file at once and will partition the input file and do all these transformations per partition in a single stage. However, which memory fraction Spark will use to load the partitioned lines, compute flatMap() and map()?
Thanks
Update:
The code shown above is only a subset of the actual application as count is saved using saveAsTextFile which will trigger the RDD calculation. Moreover, my question is generic to Spark's behavior and not specific to the posted example