2
votes

intro

In the apache spark documentation I see that the memory is divided into three groups, which can be configured using several parameters. Lets say we've got an AWS EMR machine m4.xlarge. On this machine Yarn has maximum allocation memory equal to 12288 MB. Using these configuration parameters:

  • spark.(executor|driver).memoryOverhead = 0.2
  • spark.(executor|driver).memory = 10g
  • spark.memory.fraction = 0.6 (the default)
  • spark.storage.fraction = 0.3 (the default)

I get:

  • memory overhead = 2G
  • executor memory = 10G
    • execution memory = 3G (spark.executor.memory * spark.memory.fraction * spark.storage.fraction)
    • storage memory = 3G (spark.executor.memory * spark.memory.fraction * spark.storage.fraction)
    • user memory = 4G (spark.executor.memory * 1-spark.memory.fraction)

I'm using the same configuration for both driver and executor.

First of all - is this calculation correct? Are these parameters OK? I'm mainly wondering whether it will leave enough RAM on the machine so that f.ex. YARN daemons won't fail?

main question

What is exactly stored in these memory areas?

I'm wondering because I'm doing a fairly large collect (creating a ~1.5G Map[(Long, Long)]), which I then intend to broadcast to all executors. When I did the collect without explicitly specifying the overhead (default being 0.1) the cluster failed, containers were killed by YARN for exceeding memory limits, but with overhead at 0.2 everything goes smoothly. It seems as though my Map is stored in the overhead, but what is the purpose of the executor storage then?

Thanks in advance!

1
I wouldn't say it's a duplicate. Please read the main question - I'm surprised that the collect of a large piece of data succeeds after boosting the memory overhead. Besides I'm talking about spark.executor.memoryOverhead not spark.yarn.executor.memoryOverhead. I'll try to phrase my question better. - Chris Mejka

1 Answers

0
votes

The only thing that managed to identify, using trial and error, is that f.ex. while collecting data to driver memory, the overhead needs to be able to hold it, which suggests that collect lands in the overhead.

Broadcasted variables however need to fit in the executor.memory, the memoryOverhead doesn't seem to be affected by this.