Why collecting a large dataset succeeds after increasing memoryOverhead

Question

intro

In the apache spark documentation I see that the memory is divided into three groups, which can be configured using several parameters. Lets say we've got an AWS EMR machine m4.xlarge. On this machine Yarn has maximum allocation memory equal to 12288 MB. Using these configuration parameters:

spark.(executor|driver).memoryOverhead = 0.2
spark.(executor|driver).memory = 10g
spark.memory.fraction = 0.6 (the default)
spark.storage.fraction = 0.3 (the default)

I get:

memory overhead = 2G
executor memory = 10G
- execution memory = 3G (spark.executor.memory * spark.memory.fraction * spark.storage.fraction)
- storage memory = 3G (spark.executor.memory * spark.memory.fraction * spark.storage.fraction)
- user memory = 4G (spark.executor.memory * 1-spark.memory.fraction)

I'm using the same configuration for both driver and executor.

First of all - is this calculation correct? Are these parameters OK? I'm mainly wondering whether it will leave enough RAM on the machine so that f.ex. YARN daemons won't fail?

main question

What is exactly stored in these memory areas?

I'm wondering because I'm doing a fairly large collect (creating a ~1.5G Map[(Long, Long)]), which I then intend to broadcast to all executors. When I did the collect without explicitly specifying the overhead (default being 0.1) the cluster failed, containers were killed by YARN for exceeding memory limits, but with overhead at 0.2 everything goes smoothly. It seems as though my Map is stored in the overhead, but what is the purpose of the executor storage then?

Thanks in advance!

Possible duplicate of the spark.yarn.driver.memoryOverhead or spark.yarn.executor.memoryOverhead is used to store what kind of data? — 10465355
I wouldn't say it's a duplicate. Please read the main question - I'm surprised that the collect of a large piece of data succeeds after boosting the memory overhead. Besides I'm talking about spark.executor.memoryOverhead not spark.yarn.executor.memoryOverhead. I'll try to phrase my question better. — Chris Mejka

Chris Mejka Chris Mejka · Accepted Answer · 2018-12-14T14:35:06

The only thing that managed to identify, using trial and error, is that f.ex. while collecting data to driver memory, the overhead needs to be able to hold it, which suggests that collect lands in the overhead.

Broadcasted variables however need to fit in the executor.memory, the memoryOverhead doesn't seem to be affected by this.

Why collecting a large dataset succeeds after increasing memoryOverhead

intro

main question

1 Answers