0
votes

So, in Spark when an application is started then an RDD containing the dataset for the application (e.g. words dataset for WordCount) is created.

So far what I understand is that RDD is a collection of those words in WordCount and the operations that have been done to those dataset (e.g. map, reduceByKey, etc...)

However, afaik, Spark also has HadoopPartition (or in general: partition) which is read by every executor from HDFS. And I believe that an RDD in driver also contains all of these partitions.

So, what is getting divided among executors in Spark? Does every executor get those sub-dataset as a single RDD which contains less data compared to RDD in the driver or does every executor only deals with these partitions and read them directly from HDFS? Also, when are the partitions created? On the RDD creation?

1

1 Answers

0
votes

Partitions are configurable provided the RDD is key-value based.

There are 3 main partition's property:

  1. Tuples in the same partition are guaranteed to be in the same machine.
  2. Each node in a cluster can contain more than one partition.
  3. The total number of partitions are configurable, by default it is set to the total number of cores on all the executor nodes.

Spark supports two types of partitioning:

  1. Hash Partitioning
  2. Range Partitioning

When Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop InputFormat used to read this file. When you call rdd.repartition(x) it would perform a shuffle of the data from N partitions you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.

Please see more details here and here