Divide the RDD into partitions with fixed number of elements in each partition

Question

In Apache Spark,

repartition(n) - allows partitioning the RDD into exactly n partitions.

But how to partition the given RDD into partitions such that all partitions (exception for the last partition) have specified number of elements. Given that number of elements in RDD is not known and doing .count() is expensive.

C = sc.parallelize([x for x in range(10)],2)
Let's say internally,  C = [[0,1,2,3,4,5], [6,7,8,9]]  
C = someCode(3)

Expected:

C = [[0,1,2], [3,4,5], [6, 7, 8], [9]]

Sergio Alyoshkin Sergio Alyoshkin · Accepted Answer · 2017-04-02T00:26:04

Quite easy in pyspark:

    C = sc.parallelize([x for x in range(10)],2)
    rdd = C.map(lambda x : (x, x))
    C_repartitioned = rdd.partitionBy(4,lambda x: int( x *4/11)).map(lambda x: x[0]).glom().collect()
    C_repartitioned

    [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]

It is called custom partitioning. More on that: http://sparkdatasourceapi.blogspot.ru/2016/10/patitioning-in-spark-writing-custom.html

http://baahu.in/spark-custom-partitioner-java-example/

Divide the RDD into partitions with fixed number of elements in each partition

1 Answers