Suppose I create such an RDD (I am using Pyspark):
list_rdd = sc.parallelize(xrange(0, 20, 2), 6)
then I print the partitioned elements with the glom()
method and obtain
[[0], [2, 4], [6, 8], [10], [12, 14], [16, 18]]
How has Spark decided how to partition my list? Where does that specific choice of the elements come from? It could have coupled them differently, leaving some other elements than 0 and 10 alone, to create the 6 requested partitions. At a second run, the partitions are the same.
Using a larger range, with 29 elements, I get partitions in the pattern of 2 elements followed by three elements:
list_rdd = sc.parallelize(xrange(0, 30, 2), 6)
[[0, 2], [4, 6, 8], [10, 12], [14, 16, 18], [20, 22], [24, 26, 28]]
Using a smaller range of 9 elements I get
list_rdd = sc.parallelize(xrange(0, 10, 2), 6)
[[], [0], [2], [4], [6], [8]]
So what I infer is that Spark is generating the partitions by splitting the list into a configuration where smallest possible is followed by larger collections, and repeated.
The question is if there is a reason behind this choice, which is very elegant, but does it also provide performance advantages?