I am trying to understand spark internals.
- I understood that partitions represent how the data is physically distributed across the cluster of machines.
- No of tasks = no of partitions.
- No of cores * no of executors = Maximum level parallelism can be achieved.
My question are,
I have 10 GB of file stored in HDFS of 128 MB of block size.Total partitions would be ~78. How many partitions are loaded per executor? If cluster of 4 nodes and each node having 4 cores is considered then does it mean that 4 partitions are loaded into single executor so than 4 tasks can run parallel? How partitions are distributed across cluster and what are the no of tasks created?
Consider cluster of 3 nodes, each node having 4 GB of memory. If one node is reserved for driver then two nodes are left for executors. I have to join two data sets of 10 GB each. Does spark run out of memory error as in cluster only 8 GB of memory is available and total of 20 GB data set needs to be loaded?