0
votes

I am trying to understand spark internals.

  • I understood that partitions represent how the data is physically distributed across the cluster of machines.
  • No of tasks = no of partitions.
  • No of cores * no of executors = Maximum level parallelism can be achieved.

My question are,

  1. I have 10 GB of file stored in HDFS of 128 MB of block size.Total partitions would be ~78. How many partitions are loaded per executor? If cluster of 4 nodes and each node having 4 cores is considered then does it mean that 4 partitions are loaded into single executor so than 4 tasks can run parallel? How partitions are distributed across cluster and what are the no of tasks created?

  2. Consider cluster of 3 nodes, each node having 4 GB of memory. If one node is reserved for driver then two nodes are left for executors. I have to join two data sets of 10 GB each. Does spark run out of memory error as in cluster only 8 GB of memory is available and total of 20 GB data set needs to be loaded?

1

1 Answers

0
votes

So, you have 78 files with 128 Mb let's go for the first case:

4 Nodes with 4 Cores - when you start the process you will have 16 cores reading 16 files at the same time. After each file finishes the executor will take the next file. If your execution is a simple read, transform and write, your job will do this task for each file and step for the next and will not store anything in memory or in the disk. Your max parallelization will be 16 and you will, in the best case scenario, process 16 files per time if all of them have been finished at the same time.

3 Nodes 4Gb each for join of 2 datasets of 10 GB - This case is where the magic happens with Spark. Spark tries to ensure that doesn't matter how big is your dataset Spark will always try to solve it. First, not all the data stays in memory if the data is too big Apache Spark will spill the data to disk. The shuffle happens in Memory? Yes it does, but it will happen for each file of 128Mb the shuffle will never happens with the whole 10GB dataset at once. So even if you don't have 10GB of memory, if you don't force a cache in memory Spark will spill the data in the Disk of the workers and will handle the join with the metadata that will be stored in memory. But, if you have two files with 10Gb each not splitted in 128 Mb blocks than you can face a problem of Spark tries to load everything at once.

For more information here you can find this Amazing presentation explaining the difference between the Spark Joins that I didn't discuss in my explanation but can help you to understand better how it works.