intro
In the apache spark documentation I see that the memory is divided into three groups, which can be configured using several parameters. Lets say we've got an AWS EMR machine m4.xlarge. On this machine Yarn has maximum allocation memory equal to 12288 MB. Using these configuration parameters:
- spark.(executor|driver).memoryOverhead = 0.2
- spark.(executor|driver).memory = 10g
- spark.memory.fraction = 0.6 (the default)
- spark.storage.fraction = 0.3 (the default)
I get:
- memory overhead = 2G
- executor memory = 10G
- execution memory = 3G (
spark.executor.memory * spark.memory.fraction * spark.storage.fraction
) - storage memory = 3G (
spark.executor.memory * spark.memory.fraction * spark.storage.fraction
) - user memory = 4G (
spark.executor.memory * 1-spark.memory.fraction
)
- execution memory = 3G (
I'm using the same configuration for both driver and executor.
First of all - is this calculation correct? Are these parameters OK? I'm mainly wondering whether it will leave enough RAM on the machine so that f.ex. YARN daemons won't fail?
main question
What is exactly stored in these memory areas?
I'm wondering because I'm doing a fairly large collect (creating a ~1.5G Map[(Long, Long)]), which I then intend to broadcast to all executors. When I did the collect without explicitly specifying the overhead (default being 0.1) the cluster failed, containers were killed by YARN for exceeding memory limits, but with overhead at 0.2 everything goes smoothly. It seems as though my Map is stored in the overhead, but what is the purpose of the executor storage then?
Thanks in advance!
spark.executor.memoryOverhead
notspark.yarn.executor.memoryOverhead
. I'll try to phrase my question better. - Chris Mejka