6
votes

I am trying to find out the best way to configure the memory on the nodes of my cluster. However I believe that for that there some things that I need to further understand such as how spark handles memory across tasks.

For example, let's say I have 3 executors, each executor can run up to 8 tasks in parallel (i.e. 8 cores). If I have an RDD with 24 partitions, this means that theoretically all partitions can be processed in parallel. However if we zoom into one executor here, this assumes that each task can have their partition in memory for operating on them. If not then it means that 8 tasks in parallel will not happen and some scheduling will be required.

Hence I concluded that If one sought true parallelism having some idea about the size of your partitions would be helpful as it would inform you as to how to size your executor for true parallelism.

  • Q0 - I simply want to understand a little better, what happens when not all the partitions can fit in memory within one executor? Are some spilled on disk while others are in memory for operations? Does spark
    reserve the memory for each task, and if it detects that there is not enough, does it schedule the tasks? Or do one simply run in an out
    of memory error.

  • Q1 - Does true parallelism within an executor depend also on the amount of memory available on the executor? In other words, I may have 8 cores in my cluster, but if I do not have enough memory to load 8 partitions of my data at once, then I won't be fully parallel.

As a final note I have seen several times the following statement but find it confusing a little:

“Increasing the number of partitions can also help reduce out-of-memory errors, since it means that Spark will operate on a smaller subset of the data for each executor.”

How does that work exactly ? I mean spark may work on smaller subset but if the total set of partitions can't fit in memory anyway, what happens?

1

1 Answers

8
votes

Why should I increase number of tasks (partitions)?

I would like to answer first on the last question that is confusing you. Here is a quote from another question:

Spark does not need to load everything in memory to be able to process it. This is because Spark will partition the data into smaller blocks and operate on these separately.

In fact, by default Spark tries to split input data automatically into some optimal number of partitions:

Spark automatically sets the number of “map” tasks to run on each file according to its size

One can specify number of partitions of the operation that is being performed (like for cogroup: def cogroup[W](other: RDD[(K, W)], numPartitions: Int)), and also do a .repartition() after any RDD transformation.

Moreover, later in the same paragraph of the documentation they say:

In general, we recommend 2-3 tasks per CPU core in your cluster.

In summary:

  1. the default number of partitions is a good start;
  2. 2-3 partitions per CPU is generally recommended.

How does Spark deal with inputs that do not fit in memory?

In short, by partitioning input and intermediate results (RDDs). Usually each small chunk fits in memory available for the executor and is processed fastly.

Spark is capable of caching the RDDs it has computed. By default every time an RDD is being reused it will be recomputed (is not cached); calling .cache() or .persist() can help to keep the result already computed in-memory or on disk.

Internally each executor has a memory pool that floats between execution and storage (see here for more details). When there is not enough memory for a task execution, Spark first tries to evict some storage cache, and then spills task data on disk. See these slides for further details. Balancing between executor and storage memory is well described in this blog post, which also has a nice illustration:

Spark memory allocation

OutOfMemory often happens not directly because of large input data, but because of poor partitioning and hence large auxiliary data structures, like HashMap on reducers (here documentation again advises to have more partitions than executors). So, no, OutOfMemory will not happen just because of big input, it may be very slow to process though (since it will have to write/read from disk). They also suggest that using tasks as small as 200ms (in running time) is Ok for Spark.

Outline: split your data properly: more than 1 partition per core, running time of each task should be >200 ms. Default partitioning is a good starting point, tweak the parameters manually.

(I would suggest to use a 1/8 subset of input data on a 1/8 cluster to find optimal number of partitions.)

Do tasks within same executor affect each other?

Short answer: they do. For more details, check out the slides I mentioned above (starting from slide #32).

All N tasks get N-th portion of the memory available, hence affect each other's "parallelism". If I interpret your idea of true parallelism well, it is "full utilization of CPU resources". In this case, yes, small pool of memory will result in spilling data on disk and the computations becoming IO-bound (instead of being CPU-bound).

Further reading

I would highly recommend the entire chapter Tuning Spark and Spark Programming Guide in general. See also this blog post on Spark Memory Management by Alexey Grishchenko.