I am learning spark, and when I tested repartition() function in pyspark shell with the following expression, I observed a very strange result: all elements fall into the same partition after repartition()
function.
Here, I used glom()
to learn about the partitioning within the rdd. I was expecting repartition()
to shuffle the elements and randomly distribute them among partitions. This only happens when I repartition with new number of partitions <= original partitions.
During my test, if I set new number of partitions > original partitions, there is also no shuffling observed. Am I doing anything wrong here?
In [1]: sc.parallelize(range(20), 8).glom().collect()
Out[1]:
[[0, 1],
[2, 3],
[4, 5],
[6, 7, 8, 9],
[10, 11],
[12, 13],
[14, 15],
[16, 17, 18, 19]]
In [2]: sc.parallelize(range(20), 8).repartition(8).glom().collect()
Out[2]:
[[],
[],
[],
[],
[],
[],
[2, 3, 6, 7, 8, 9, 14, 15, 16, 17, 18, 19, 0, 1, 12, 13, 4, 5, 10, 11],
[]]
In [3]: sc.parallelize(range(20), 8).repartition(10).glom().collect()
Out[3]:
[[],
[0, 1],
[14, 15],
[10, 11],
[],
[6, 7, 8, 9],
[2, 3],
[16, 17, 18, 19],
[12, 13],
[4, 5]]
I am using spark version 2.1.1.